Event History Models: Why R? Why SabreR? Rob Crouchley
description
Transcript of Event History Models: Why R? Why SabreR? Rob Crouchley
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Event History Models: Why R? Why SabreR?
Rob Crouchley
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Contents
• Some science • Performance of the available tools for multilevel models• Breaking the technological barrier to adoption (sabreR)• Demo• Performance of parallel sabreR• Conclusions
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Some Science: BHPS Data (small dataset)
• Sample of males who were employed and earning a wage at some point over the period 1991-2003 (13 years)
• Gives a total of 5130 individuals with a sequence of responses that occurred somewhere in the 1991-2003 interval
• At the 1st sample point of the survey (1991) there were 2316 individuals of whom 945 of these males had some form of training in the previous 12 months,
• 106 had been promoted in the previous 12 months. The mean of the log of their weekly wage was 5.65 (Sterling)
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
What is the Effect of Training & Promotion on Wages?
• Suppose we want to disentangle the dependencies between:• Promotion (P=1,0) in the last 12 months (latent var P*)• On the job training (T=1,0) in the last 12 months (latent var T*)• Current wages (W)
ep P* P
et T* T
ew W
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Correlated Random Effects Model
up P* P
ut T* T
euw
W e*p
e*t
e*w
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Commercial Software for MGLMMs
• Stata: http://www.stata.com/ Standard/Adapt Quadrature, Newton Raphson. See also Stata MP
• SAS PROC NLMIXED: http://www.sas.com/ Standard/Adap Quadrature and Taylor/Laplace expansions, Quasi Newton. See also SAS PROC MPCONNECT and SAS Grid computing
• Limdep: http://www.limdep.com/ Quadrature, Quasi Newton
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
MGLMMs: Other Systems
• MLwiN: http://www.cmm.bristol.ac.uk/ Laplace approximation and IRLS (also MCMC)
• Gllamm (Stata prog): http://www.gllamm.org/ Stan/Adap Quadrature, Newton Raphson
• aML: http://www.applied-ml.com/
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Packages at http://cran.r-project.org/ for GLMMs and MGLMMs
• lmer (http://cran.r-project.org/web/packages/lme4/index.html) Laplace Approx, penalized iteratively reweighted least squares
• npmlreg (http://cran.r-project.org/web/packages/npmlreg/index.html) Quadrature and NPML, EM algorithm
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Why Quadrature?• PQL: Parameter estimates tend to be biased for binary dependent
variables with small cluster sizes and high intraclass correlations (e.g. Rodriguez and Goldman, 1995, 2001)
• PQL: does not involve a likelihood, which prohibits the use of likelihood based inference
• Laplace Approximation: The 6th order expansion (Raudenbush et al., 2000) worked as well as 7-point AQ in simulations of a two-level binary dependent variable model
• The precision of GQ and AQ can be increased by simply using more quadrature points
• We can not increasing the degree of the Taylor or Laplace Expansion beyond the 2, 4 or 6 terms allowed for
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Simulation Based Methods• Computer intensive alternatives to GQ and AQ include simulation
based approaches such as Markov Chain Monte Carlo (MCMC) (e.g. Gelman et al., 2003) and maximum simulated likelihood (MSL) (Hajivassiliou and Ruud, 1994)
• The hierarchical structure of multilevel models lends itself naturally to MCMC using for instance Gibbs sampling. If vague priors are specified, the method essentially yields maximum likelihood estimates
• Unfortunately, a problem with MCMC is how to ensure that a truly stationary distribution has been obtained for MGLMMs, especially when we have a lot of structural and incidental parameters
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
In tests, serial sabre out performs other software
lmer: GQ and AQ not yet implemented, REML and ML give Laplace approx answernpmlreg: GQ times as AQ not availableSabre used Portand Group PGF90 7.1-6 Compiler with –FAST (Level 2 optimization)Times are system times (very close to real time in all figures), very little variation between runs R and gllamm interpreted code, SAS?
Example data Obs Cases Vars Size (tab) Method Stata gllamm SAS npmlreg lmer Sabre 1univariate Wages (W) 31022 5285 74 17.1MB AQ (12) 15" 22h23' 3'26" 1h28' 2'03" 1'05"univariate Train (T) 31022 5285 71 17.1MB AQ (16) 11'51" 25h32' 7+days 44'39" 5'51" 50"univariate Prom (P) 31022 5285 72 17.1MB AQ (16) 15'08" 25h32' 7+days 58'37" 4'37" 52"bivariate T & P 62044 5285 143 34.2MB AQ (16x16) na 150+days 30+days na nd 1h42’trivariate W & T & P 93066 5285 217 51.3MB AQ (12x16x16) na 15+yrs 1+yrs na nd 115h45’
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
• MlwiN (MCMC, IGLS) are 2-25 x slower in univariate 2-level models
• For others see the Sabre sitehttp://sabre.lancs.ac.uk/
Other Sabre comparisons – V small to small sized data sets :
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Changes in Substantive Findings Between Models
Models
Homog Indep Dep
Covariate Promo 0.09499 0.06103 0.05288
Coeff 0.00824 0.00599 0.00611in Wage Train -0.00683 -0.00865 -0.00864
Equation 0.00526 0.00396 0.00405
Likelihood -38471.93 -29448.19 -29419.52
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Breaking the technological barrier to adoption
Previously• 2X harder to use the NGS than use your local
HPC (private computing facility)Now• It is easier to use the NGS (public computing
facility) than it is to use your local HPC
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Enabling Technology for grid computing
All you need is:1. An internet connection2. The installation of our multiR or sabreR
packages for R3. A certificate to identify the client to the host
-- typically a grid certificate
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Also
• Users do not need to install or have familiarity of Globus, VDT, gsissh, gsiscp, grid-ftp, grid-proxy tools or any other GRID related software.
• There is very little difference between using the Sabre library from within R on the desktop, and using Sabre for statistical modelling on the grid from within R.
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Desktop Vs Grid on the Windows desktop
Serial sabreR
sabre.model.1<sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian“, first.mass=64, first.scale=0.5)
#display resultssabre.model.1
Parallel sabreR
# load previously saved grid session objectload(file=“ncess.demo.session.R")
sabre.model.2<-sabre(proximity~factor(time)-1, case=teacher, first.family="gaussian", first.mass=64, first.scale=0.5, session=ncess.demo.session,
description="here ya go !!")
# recover the results and display themsabre.results(ncess.demo.session,sabre.model.2)
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Demo
• rob_sabrer_edit2.mov
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Master-Slave (Distributed Memory) Model for MPI as used by Sabre on the NW-Grid
Li, Hi, di, a’si=1,...,1000
MASTERProcess
Slave Processes
Li, Hi, di, b’si=1001,...,2000
Li, Hi, di, c’si=2001,...,3000
Li, Hi, di, d’si =3001,...,4000
a+b+c+dfor L,H and d, etcthen NR
There is no commercial software on the NGS or NW-GRID (licensing and cost issues)
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Performance of Parallel SabreRelative performance of Parallel Sabre compared to serial sabre (=100) on example datasets
Seria
l
2 pr
oc
4 pr
oc
8 pr
ocL7 - filled
L7 - lapsed
L8 - filled-lapsedL9 - Union wage
0.0
10.0
20.0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
100.0Pe
rform
ance
# processors
L7 - filled 100.0 54.4 31.9 20.3
L7 - lapsed 100.0 55.0 32.8 21.4
L8 - filled-lapsed 100.0 58.9 34.5 22.0
L9 - Union wage 100.0 50.2 25.5 13.3
Serial 2 proc 4 proc 8 proc
In the Wage example
5 days becomes 2.75 hours on 48
processors
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Why R?• Commercial Tools (Stata, SAS) are of limited use on a public grid, e.g. Stata
MP can not have multiple data sets in memory and neither system provides access to their source code
• There are no plans to install them on the UK National Grid Service (NGS) because of cost/licensing issues
• R is an effective, efficient and easy to use tool for Statistical Modelling• Many existing tried and tested statistical methods already available for R
can easily be modified to exploit the benefits of grid computing• Work flows to support the modelling process are simple to create.• R is easy to install on most popular operating systems (Windows, Unix,
OSX) and can be used directly from a USB memory stick• R includes a programming environment, which when used in conjunction
with our multiR and sabreR packages, automatically provides a data centric scripting tool for grid computing
• There are no licensing issues
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
Conclusions• This approach makes all the grid middleware invisible
and thus removes the biggest barrier to take up.This approach can provide researchers with more sophisticated statistical modelling tools and help increase their understanding of complex processes and thus help them to undertake more effective research
• Social researchers do not need to let their large scale science agenda using GLMs be set by the developments of the big statistics software houses, like SAS, Stata etc.
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
CollaboratoryForQuantitativeE-SocialScience
stop/end
23