A Flexible Model for Count Data: The COM Poisson
Transcript of A Flexible Model for Count Data: The COM Poisson
A Flexible Model
for Count Data:
The COM Poisson
Galit Shmuéli
Indian School of Business
Minka (Microsoft) Borle (Rice U) Sellers (Georgetown)
Boatwright & Kadane (CMU) Bose, Sur Dubey (ISI)
Non-Poisson data
used to be exotic
Bliss & Fisher (1953)
European female red
mites on apple leaves.
Bacterial clumps in
milk drops.
#Lice on heads of
Hindu male prisoners
in Cannamore, South
India 1937-39.
Today non-Poisson counts are common
Email traffic
Visits to websites
Calls to service centers
Online transactions
Bids in online auctions
Messages via online dating sites
Tweets, Facebook posts, blog comments
Conway-Maxwell-Poisson
),(
!)( 1
Zy
yPy
Y
Shmueli et al. (JRSS C, 2005) A Useful Distribution for Fitting Discrete Data:
Revival of the CMP Distribution
0 !j
j
jZ
,1,0,0,0 y
Generalizes well-known distributions
Poisson (=1)
Bernoulli ()
Geometric (=0, <1)
),(
!)( 1
Zy
yPy
Y
!
0
j
j
jZ
)(
)1(
y
yp
yP
Properties: Exponential Family
),(log)!log(log),;(log ZnyyyL ii
Truncation/
Approximation
Minka, Shmueli, Kadane, Borle & Boatwright (Tech Report, CMU Dept of Stat, 2003)
Computing with the COM-Poisson Distribution
Properties: Moments
log
)()(
2
1
log
),(log)( /1
YEYVar
ZYE
10or 1
)(YE
(Thanks to Ralph Snyder, Monash U)
WLS log py-1 / py = - log + n log y
ML
Bayes
Conjugate prior:
Kadane, Shmueli, Minka, Borle & Boatwright (Bayesian Analysis, 2006)
Conjugate Analysis of the Conway-Maxwell-Poisson Distribution
0
100
200
300
400
500
600
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
Number of purchased items
Fre
qu
en
cy
Quarterly sales of socks
(=0.97, =0.126)
Word length in
Hungarian dictionary
(=7.74, =2.15)
0
5000
10000
15000
20000
25000
0 1 2 3 4 5 6 7 8 9
Word Length (Number of syllables)
Fre
qu
en
cy
Kadane, Krishnan, and Shmueli, (Management Science, 2006)
A Data Disclosure Policy for Count Data Based on the COM-Poisson Distribution
Data Disclosure
Bose, Dubey, Shmueli, Sur (2012) Working paper
Modeling Bi-Modal Count Data Using COM-Poisson Mixture Models
EM Algorithm Rating Data Poisson Mixture CMP Mixture
ice absent 39 31 36
ice present somewhat low 9 42 34
neutral 75 47 46
ice present somewhat high 52 45 47
ice present very high 24 35 36
Estimates
p 0.1453 0.11
1.2, 3.9594 0.92, 4.98
4.6, 1.2
Log likelihood -373.4166 -335.7
AIC 819.6 681.4
BIC 829.5 697.9
p
j
ijji x1
0
/1 )ln(
# crashesi ~ CMP()
n=868, 2 predictors
Uninformative priors
MCMC: 35,000 replications
Bayesian Implementation: Transportation
p
j
ijji x1
0
/1 )ln(
# crashesi ~ CMP()
n=868, 2 predictors
Uninformative priors
MCMC: 35,000 replications
Bayesian Implementation: Transportation
generalize popular models
estimation
diagnostics
inference
Our Approach: Classic GLM
natural link function
exponential family quick & elegant
Sellers & Shmueli (Annals of Applied Statistics, 2010)
A Flexible Regression Model for Count Data
Link Function
Poisson Regression (=1)
Logistic Regression ()
Geometric Regression (=0, <1)
2
1
log
),(log)( /1
ZYE
ji
p
j
jii xyE ,
1
0)log()(
Negative Binomial Regression (over-dispersion)
Logistic regression (binary)
Linear regression of log Y
Restricted Generalized Poisson (Famoye, 1993)
Alternative Regression Models
Example 1: Diagnostics
Leverage (Hii)
LinReg Poisson CMP
0.1 0.103 0.154
0.2 0.183 0.194
0.2 0.183 0.273
0.2 0.183 0.226
0.5 0.594 0.600
0.1 0.103 0.379
0.2 0.183 0.238
0.1 0.103 0.112
0.2 0.183 0.365
0.2 0.183 0.458
Example 2: Book Purchases
Direct marketing campaign of art book
n=1000 customers, p=2 predictors
Response: purchase / no purchase
Under-
dispersion
Example 3: Motor Vehicle Crashes 868 intersections, 2 traffic predictors
Response = # annual crashes
Example 3: Motor Vehicle Crashes
Link: log(/)
Uninformative priors
MCMC
35,000 replications
Runtime: 5 hours
Lord et al. (2008): Bayesian approach
Detecting Dispersion Mixtures Is the observed dispersion real or apparent?
Sellers and Shmueli, forthcoming (Communications in Statistics: Theory & Methods)
Data Dispersion: Now You See it... Now You Don't
CMP Regression has several advantages
Extendable to observation-level
dispersion (i)
Exponential Family:
Estimation
Inference
Diagnostics
Flexibility
Parsimony
Computational
Efficiency
Generalizes popular
Poisson regression
and
logistic regression
R Code: compoisson on CRAN www9.georgetown.edu/faculty/kfs7/research
Weaknesses: • No “easy” closed form
• No direct interpretation of reg. coefficients
• Some computational issues
Lots of room for further development and new
applications; especially prediction
Methodological
Extensions
Various Regression models
Control charts
Mixtures
Cure-rate models
The COM-Poisson Model for Count Data:
A Survey of Methods and Applications
Sellers, Borle and Shmueli
ASMBI, 2012 with discussion
Applications
Marketing
eCommerce
Transportation
Healthcare
Biology