Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019....
Transcript of Orthogonal Array Based Big Data Subsampling Hongquan Xuhqxu/stat201A/talk-bigdata2019.pdf · 2019....
Orthogonal Array Based Big Data
Subsampling
Hongquan Xu
University of California, Los Angeles
Joint work with
Lin Wang, Jake Kramer and Weng Kee Wong
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 1 / 20
Motivation
Data are big (sometimes redundant)
Analyzing the full data may be computationally unfeasible
Storing all of the data may not be possible
There are a lot of circumstances under which X is big while the
label (or response) Y is expensive to obtain
Big X Small Y
Patients’ records Performance of
a new medicine
Images as visual Brain response
stimuli
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 2 / 20
Linear Regression Setup
y = β0 + β1x1 + · · ·+ βpxp + ε
Question:
1. If the budget only allows k responses (labels), k � n, choose
which k to label?
2. When all responses are available, to accelerate the computation,
how to choose a subsample of size k � n ?
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 3 / 20
Leverage Subsampling
A subsampling method consists of
sampling probabilities πi , i = 1, . . . , n,∑n
i πi = 1
a weighted estimator β̂s = (XTs WXs)−1XT
s WYs , where Xs is
the subsample taken from X and W = diag(w1, . . . ,wk) is a
weight matrix
1. Uniform sampling: πi = 1/n, wi = 1
2. Leveraging: πi = hii/(p + 1), wi = 1/πi , where
hii = (X (XTX )−1XT )ii
(Drineas et al., 2006; Ma et al., 2015)
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 4 / 20
Information-based Subsampling
The OLS for a subsample Xs is β̂s = (XTs Xs)−1XT
s Ys
E(β̂s) = β, Var(β̂s) = σ2(XTs Xs)−1
The Fisher information matrix for β with subdata is
Ms = XTs Xs
D-optimality: to find Xs with k points that maximizes det(Ms)
Available approach: IBOSS (Wang H., Yang, and Stufken, 2018 JASA)
Include data points with extreme (largest and smallest) covariate
values
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 5 / 20
Orthogonal Array-based Subsampling
Lemma
Suppose each covariate is scaled to [−1, 1]. For a subsample Xs of
size k ,
det(Ms) ≤ kp+1,
and the equality holds (D-optimal) if and only if Xs forms a
two-level OA with levels from {−1, 1} and strength t ≥ 2.
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 6 / 20
A Measure of “Similarity”
For x , y ∈ [−1, 1], define
δ(x , y) =
{2− (x2 + y2)/2, if sign(x) = sign(y);
1− (x2 + y2)/2, otherwise.
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
x
y
{1 ≤ δ(x , y) ≤ 2, if sign(x) = sign(y);
0 ≤ δ(x , y) ≤ 1, otherwise.
δ(−1, 1) = δ(1,−1) = 0
δ(−1,−1) = δ(1, 1) = 1
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 7 / 20
J-optimality
For two data points xi = (xi1, . . . , xip) and xj = (xj1, . . . , xjp), let
δ(xi , xj) =
p∑l=1
δ(xil , xjl)
and define the J-optimality criterion (Xu 2002, Technometrics) as
J(Xs) =∑
1≤i<j≤k[δ(xi , xj)]2 .
Theorem
For any k-point subsample Xs over [−1, 1]p,
J(Xs) ≥ [k2p(p + 1)− 4kp2]/8,
with equality if and only if Xs forms an OA with strength 2.
Question: How to find Xs that minimizes J(Xs)?Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 8 / 20
Algorithm: Select and eliminate points simultaneously
Step 0. Given n × p matrix X , scale each variable to [-1,1].
Step 1. Find a point that is farthest to the origin. Include it as Xs and
let i = 1.
Step 2. For each x ∈ X , compute the J-score
J(x ,Xs) =∑xs∈Xs
δ(x , xs)2.
Step 3. Find x∗ ∈ X that minimizes J(x ,Xs) and add x∗ to Xs .
Step 4. Keep t = bn/ic points in X with t smallest J(x ,Xs) values.
Remove x∗ and other points from X .
Step 5. Increase i by 1 and repeat Steps 2–4 until Xs contains k points.
Note: The complexity is O(np log(k)) as∑k
i=1(1/i) = O(log(k)).
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 9 / 20
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[, 1]
x[, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 10 / 20
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 11 / 20
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 12 / 20
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 13 / 20
Algorithm
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x[candi, 1]
x[c
andi, 2
]
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 14 / 20
A Toy Simulation
Setup: x1, x2iid∼ Unif[−1, 1], n = 1000, p = 2, k = 100, ε ∼ N(0, 1)
y = 1 + 2x1 + 2x2 + ε
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
Full data 1000x2
x_1
x_2
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
IBOSS method with k=100
x_1
x_2
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
OA−based method with k=100
x_1
x_2
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 15 / 20
Comparison
Unif_Sampling Leveraging IBOSS OA−based
0.0
00
.05
0.1
00
.15
0.2
0
MSEs for slope parameters (1000 repetitions)
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 16 / 20
Theoretical Properties
Assume that the covariates x1, . . . , xp are i.i.d. with a uniform
distribution in [a, b]p, and we are considering the model
y = β0 + β1x1 + · · ·+ βpxp + ε.
Theorem
The OA-based subsampling minimizes MSE, i.e., the sum of the
variances of coefficient estimations, almost surely as n→∞.
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 17 / 20
Model with Interactions
y = β0 + β1x1 + · · ·+ βpxp + β12x1x2 + · · ·+ β(p−1)pxp−1xp + ε
Define the J-optimality criterion as
J4(Xs) =∑
1≤i<j≤k[δ(xi , xj)]4 .
Assume that the covariates x1, . . . , xp are i.i.d. with a uniform
distribution in [a, b]p.
Theorem
The OA-based subsampling minimizes MSE, i.e., the sum of
variances of coefficient estimations, almost surely as n→∞.
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 18 / 20
Simulation
Setup: x1, . . . , xpiid∼ Unif[−1, 1], n = 104, p = 10, k = 128,
ε ∼ N(0, 32)
y = 1 + x1 + · · ·+ x10 + x1x2 + · · ·+ x9x10 + ε
Unif_Sampling Leveraging IBOSS OA−based
0.5
1.0
1.5
MSEs for slope parameters (100 repetitions)
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 19 / 20
Summary
Propose a new framework for subsampling: OA-based
subsampling
Computational complexity: O(np log(k))
Minimize J(Xs) for a linear model without interactions, attains
optimality for coefficient estimation for large n
Minimize J4(Xs) for a linear model with interactions, attains
optimality for coefficient estimation for large n
More efficient than leverage sampling and IBOSS
Hongquan Xu (UCLA) Orthogonal Array Based Big Data Subsampling 20 / 20