UCLA Department of Statistics - CARME 2011carme2011.agrocampus-ouest.fr/slides/deleeuw.pdf · In...
Transcript of UCLA Department of Statistics - CARME 2011carme2011.agrocampus-ouest.fr/slides/deleeuw.pdf · In...
UCLA Department of Statistics
History and Theoryof Nonlinear Principal Component Analysis
Jan de Leeuw
February 11, 2011
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Abstract
Relationships between Multiple Correspondence Analysis (MCA) andNonlinear Principal Component Analysis (NLPCA), which is defined asPCA with Optimal Scaling (OS), are discussed. We review the history ofNLPCA.
We discuss forms of NLPCA that have been proposed over the years:Shepard-Kruskal- Breiman-Friedman-Gifi PCA with optimal scaling,Aspect Analysis of correlations,Guttman’s MSA,Logit and Probit PCA of binary data, andLogistic Homogeneity Analysis.
Since I am trying to summarize 40+ years of work, the presentation willbe rather dense.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Linear PCAHistory
(Linear) Principal Components Analysis (PCA) is sometimes attributedto Hotelling (1933), but that is surely incorrect.
The equations for the principal axes of quadratic forms and surfaces, invarious forms, were known from classical analytic geometry (notablyfrom work by Cauchy and Jacobi in the mid 19th century).
There are some modest beginnings in Galton’s Natural Inheritance of1889, where the principal axes are connected for the first time with the“correlation ellipsoid".
There is a full-fledged (although tedious) discussion of the technique inPearson (1901), and there is a complete application (7 physical traits of3000 criminals) in MacDonell (1902), by a Pearson co-worker.
There is proper attribution in: Burt, C., Alternative Methods of FactorAnalysis and their Relations to Pearson’s Method of “Principle Axes”, Br.J. Psych., Stat. Sec., 2 (1949), pp. 98-121.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Linear PCAHow To
Hotelling’s introduction of PCA follows the now familiar route of makingsuccessive orthogonal linear combinations with maximum variance. Hedoes this by using Power iterations (without reference), discussed in1929 by Von Mises and Pollaczek-Geiringer.
Pearson, following Galton, used the correlation ellipsoid throughout.This seems to me the more basic approach.
He cast the problem in terms of finding low-dimensional subspaces(lines and planes) of best (least squares) fit to a cloud of points, andconnects the solution to the principal axes of the correlation ellipsoid.
In modern notation, this means minimizing SSQ(Y −XB′) over n× rmatrices X and m× r matrices B. For r = 1 this is the best line, etc.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Correspondence AnalysisHistory
Simple Correspondence Analysis (CA) of a bivariate frequency tablewas first discussed, in fairly rudimentary form, by Pearson (1905), bylooking at transformations linearizing regressions. See De Leeuw, Onthe Prehistory of Correspondence Analysis, Statistica Neerlandica, 37,1983, 161–164.
This was taken up by Hirshfeld (Hartley) in 1935, where the techniquewas presented in a fairly complete form (to maximize correlation anddecompose contingency). This approach was later adopted byGebelein, and by Renyi and his students in their study of maximalcorrelation.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Correspondence AnalysisHistory
In the 1938 edition of Statistical Methods for Research Workers Fisherscores a categorical variable to maximize a ratio of variances (quadraticforms). This is not quite CA, because it is presented in an (asymmetric)regression context.
Symmetric CA and the reciprocal averaging algorithm are discussed,however, in Fisher (1940) and applied by his co-worker Maung(1941a,b).
In the early sixties the chi-square metric, relating CA to metricmultidimensional scaling (MDS), with an emphasis on geometry andplotting, was introduced by Benzécri (thesis of Cordier, 1965).
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Multiple Correspondence AnalysisHistory
Different weighting schemes to combine quantitative variables to anindex that optimizes some variance-based discrimination orhomogeneity criterion were proposed in the late thirties by Horst (1936),by Edgerton and Kolbe (1936), and by Wilks (1938).
The same idea was applied to quantitative variables in a seminal paperby Guttman (1941), that presents, for the first time, the equationsdefining Multiple Correspondence Analysis (MCA).
The equations are presented in the form of a row-eigen (scores), acolumn-eigen (weights), and a singular value (joint) problem.
The paper introduces the “codage disjonctif complet” as well as the“Tableau de Burt”, and points out the connections with the chi-squaremetric.
There is no geometry, and the emphasis is on constructing a singlescale. In fact Guttman warns against extracting and using additionaleigen-pairs.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Multiple Correspondence AnalysisFurther History
In Guttman (1946) scale or index construction was extended to pairedcomparisons and ranks. In Guttman (1950) it was extended to scalablebinary items.
In the fifties and sixties Hayashi introduced the quantification techniquesof Guttman in Japan, where they were widely disseminated through thework of Nishisato. Various extensions and variations were added by theJapanese school.
Starting in 1968, MCA was studied as a form of metric MDS by DeLeeuw.
Although the equations defining MCA were the same as those definingPCA, the relationship between the two remained problematic.
These problems are compounded by “horse shoes” or the “effectGuttman”, i.e. artificial curvilinear relationships between successivedimensions (eigenvectors).
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Nonlinear PCAWhat ?
PCA can be made non-linear in various ways.1 First, we could seek indices which discriminate maximally and are
non-linear combinations of variables. This generalizes the weightingapproach (Hotelling).
2 Second, we could find nonlinear combinations of components that areclose to the observed variables. This generalizes the reduced rankapproach (Pearson).
3 Third, we could look for transformations of the variables that optimizethe linear PCA fit. This is known (term of Darrell Bock) as the optimalscaling (OS) approach.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Nonlinear PCAForms
The first approach has not been studied much, although there are somerelations with Item Response Theory.
The second approach is currently popular in Computer Science, as“nonlinear dimension reduction”. I am currently working on a polynomialversion, but there is not unified theory, and the papers are usually of the“‘well, we could also do this” type familiar from cluster analysis.
The third approach preserves many of the properties of linear PCA andcan be connected with MCA as well. We shall follow its history anddiscuss the main results.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Nonlinear PCAPCA with OS
Guttman observed in 1959 that if we require that the regression betweenmonotonically transformed variables are linear, then the transformationsare uniquely defined. In general, however, we need approximations.
The loss function for PCA-OS is SSQ(Y −XB′), as before, but now weminimize over components X , loadings B, and transformations Y .
Transformations are defined column-wise (over variables) and belong tosome restricted class (monotone, step, polynomial, spline).
Algorithms often are of the alternating least squares type, where optimaltransformation and low-rank matrix approximation are alternated untilconvergence.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
PCA-OSHistory of programs
Shepard and Kruskal used the monotone regression machinery ofnon-metric MDS to construct the first PCA-OS programs around 1962.The paper describing the technique was not published until 1975.
Around 1970 versions of PCA-OS (sometimes based on Guttman’s rankimage principle) were developed by Lingoes and Roskam.
In 1973 De Leeuw, Young, and Takane started the ALSOS project, withresulted in PRINCIPALS (published in 1978), and PRINQUAL in SAS.
In 1980 De Leeuw (with Heiser, Meulman, Van Rijckevorsel, and manyothers) started the Gifi project, which resulted in PRINCALS, in SPSSCATPCA, and in the R package homals by De Leeuw and Mair (2009).
In 1983 Winsberg and Ramsay published a PCA-OS version usingmonotone spline transformations.
In 1987 Koyak, using the ACE smoothing methodology of Breiman andFriedman (1985), introduced mdrace.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
PCA/MCAThe Gifi Project
The Gifi project followed the ALSOS project. It has or had as its explicit goals:
1 Unify a large class of multivariate analysis methods by combining asingle loss function, parameter constraints (as in MDS), and ALSalgorithms.
2 Give a very general definition of component analysis (to be calledhomogeneity analysis) that would cover CA, MCA, linear PCA, nonlinearPCA, regression, discriminant analysis, and canonical analysis.
3 Write code and analyze examples for homogeneity analysis.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiLoss of Homogeneity
The basic Gifi loss function is
σ(X ,Y ) =m
∑j=1
SSQ(X −GjYj).
The n× kj matrices Gj are the data, coded as indicator matrices (ordummies). Alternatively, Gj can be a B-spline basis. Also, Gj can havezero rows for missing data.
X is an n×p matrix of object scores, satisfying the normalizationconditions X ′X = I.
Yj are kj ×p matrices of category quantifications. There can be rank,level and additivity constraints on the Yj .
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiALS
The basic Gifi algorithm alternates
1 X (k) = ORTH(∑mj=1 GjY
(k)j ).
2 Y (k+1)j = argmin
Yj∈Yj
tr (Y (k+1)j −Yj)
′Dj(Y(k+1)j −Yj).
We use the following notation.
Superscript (k) is used for iterations.
ORTH() is any orthogonalization method such as QR, Gram-Schmidt,or SVD.
Dj = G′j Gj are the marginals.
Y (k+1)j = D−1
j G′j X(k) are the category centroids.
The constraints on Yj are written as Yj ∈ Yj .
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiStar Plots
Let’s look at some movies.
GALO: 1290 students, 4 variables. We show both MCA and NLPCA.
Senate: 100 senators, 20 votes. Since the variables are binary, MCA =NLPCA.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
objplot galo
12
3
4
56
7
8
9
10
11
12131415
16
171819
20
212223
24
25
26 2728
29
30
31
32
33
343536
37
38
39
4041 42
43
44
4546
4748
49
50
51
52
53
54
55
56
57
58
59
60
616263
64
65
66
67
68
69
70
71
72
73
74
7576
77
78
7980
81
828384
85
86
87
88
89
90
91
92
93
94
95
96
97
98 99
100
101
102
103
104
105
106
107
108
109110
111
112
113
114115
116
117
118
119120121
122
123
124
125
126127
128
129
130
131 132133
134
135
136
137138
139
140141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156 157158
159
160
161
162163
164
165
166
167
168
169
170
171
172
173
174 175
176
177
178
179
180 181
182
183184
185
186
187
188
189
190
191
192 193
194
195
196197
198
199
200
201
202203
204
205 206
207
208
209210
211
212213
214
215
216217
218
219
220
221222
223
224
225226
227 228229
230
231
232
233
234
235
236
237
238
239
240
241242
243
244
245
246
247
248
249
250
251252
253
254
255
256
257
258
259260
261
262
263
264
265
266
267268
269
270
271272
273
274
275276
277
278
279
280 281
282
283
284285
286
287
288
289
290
291
292293
294
295
296
297298
299
300
301 302
303 304
305
306
307
308
309
310
311
312
313314
315
316
317
318
319
320
321
322
323
324325
326 327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345346
347
348
349
350
351
352
353
354
355
356357
358
359
360
361362
363364
365
366
367368
369
370
371
372
373
374
375
376
377
378
379 380381
382
383
384
385 386
387
388389
390
391
392
393
394
395
396
397
398399
400
401
402
403
404
405
406
407408
409
410 411
412
413
414 415
416
417
418
419
420
421
422
423
424
425
426
427428
429
430431
432
433
434
435
436
437438
439
440
441
442443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460 461
462
463464465466
467468
469
470
471
472
473
474
475
476
477
478
479480
481
482483
484
485
486
487
488
489
490
491
492
493
494
495496
497498
499
500
501 502
503
504
505
506
507
508509
510
511
512
513514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529530
531
532533
534
535536537
538
539
540
541
542
543
544
545546
547
548
549
550
551
552
553
554
555556
557
558559
560
561
562
563
564
565
566
567568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618619
620
621622
623
624
625
626
627
628
629
630
631632
633
634
635
636637 638
639
640
641
642
643644
645
646
647648
649
650
651
652
653
654
655
656
657658
659
660
661
662
663
664 665
666
667
668
669
670
671
672
673
674675
676
677
678
679
680
681
682
683
684685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702703
704
705
706
707
708
709
710711
712
713
714
715
716
717
718
719720
721722
723
724
725
726
727
728
729
730
731732
733 734735
736737
738
739
740
741
742
743
744
745
746
747
748
749750751
752
753
754
755
756
757
758
759
760
761
762
763764
765
766
767
768
769
770
771
772
773
774
775
776 777
778
779
780
781
782
783
784
785
786
787 788
789
790
791
792
793
794
795
796
797
798799 800 801
802803
804
805806
807808809
810
811
812
813814
815
816
817
818
819
820
821822 823824
825826
827
828
829
830831
832 833
834
835
836
837
838
839
840841
842 843844845
846
847
848
849850
851852
853
854
855
856
857
858859
860
861862
863
864865
866
867
868
869
870
871
872
873 874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899900901902
903
904 905
906
907
908
909910911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937938
939 940
941
942
943
944945946
947
948
949
950
951
952
953
954
955
956
957958
959
960
961
962963
964
965
966967
968969
970
971
972
973974
975
976
977978
979980
981
982
983984985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
100010011002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
10171018
1019
10201021
1022
1023
1024
1025
102610271028
1029
1030
1031
1032
1033
10341035
1036
1037
1038
1039
1040
10411042
1043
10441045
1046
1047
1048
1049
10501051
1052
1053
1054
1055
1056
1057
1058
1059
1060
10611062
1063
10641065
1066
1067
1068
10691070
1071
1072
1073
1074
1075
1076
1077
1078
1079
10801081
1082
108310841085
1086
1087
10881089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
11071108
1109
1110
1111
11121113
1114
1115
1116
1117
1118
1119 1120
1121
1122
1123
11241125
1126
1127
11281129
1130
1131
1132
11331134
1135
1136
1137
11381139
1140
1141
11421143
1144
1145
1146
11471148
11491150
1151
1152
1153
1154
1155
1156
1157
1158
1159
11601161
1162
1163
1164
1165
1166
1167
11681169
1170
1171
1172
1173
1174
1175
11761177
1178
1179
1180
1181 1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203 1204
1205
1206
1207
12081209
12101211
1212
1213
1214
12151216
1217
1218
121912201221
12221223
12241225
1226
1227
1228
1229
1230
1231
12321233
1234
1235
1236
1237
1238
12391240
1241
1242
1243
1244
1245
1246
1247 1248
1249
1250
125112521253
1254
1255
1256
1257
1258
1259
1260
1261
1262
126312641265
1266
1267
1268
126912701271
1272
12731274 1275
1276
12771278
1279
12801281
1282
1283
1284
1285
1286 1287
12881289
1290
Jan de Leeuw
NLPCA History UCLA Department of Statistics
objplot senate
Sessions
ShelbyMurkowski
Stevens
Kyl
McCain
Hutchinson
Lincoln
Boxer
Feinstein
Allard
Campbell
Dodd
Lieberman
Biden
Carper
Graham
Nelson
Cleland
Miller
Akaka
Inouye
Craig
Crapo
DurbinFitzgerald
BayhLugar
Grassley
Harkin
BrownbackRobertsBunning
McConnell
Breaux
Landrieu
CollinsSnowe
MikulskiSarbanes
KennedyKerryLevinStabenowDaytonWellstone
Cochran
Lott
Bond Carnahan
Baucus
Burns
Hagel
Nelson1
Ensign
Reid
Gregg
Smith1
Corzine
Torricelli
BingamanDomenici
ClintonSchumerEdwards
Helms
ConradDorgan
DeWine
Voinovich
Inhofe
Nickles
Smith
Wyden
Santorum
Specter
Chafee
ReedHollings
Thurmond
Daschle
Johnson
Frist
Thompson
Gramm
Hutchison
BennettHatch
Jeffords
Leahy
Allen
Warner
Cantwell
Murray
Byrd
RockefellerFeingold
Kohl
Enzi
Thomas
Jan de Leeuw
NLPCA History UCLA Department of Statistics
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiSingle Variables
If there are no constraints on the Yj homogeneity analysis is MCA.
We will not go into additivity constraints, because they take us from PCAand MCA towards regression and canonical analysis. See the homalspaper and package.
A single variable has constraints Yj = zja′j , i.e. category quantifications
are of rank one. In a given analysis some variables can be single whileother can be multiple (unconstrained). More generally, there can be rankconstraints on the Yj .
This can be combined with level constraints on the single quantificationszj , which can be numerical, polynomial, ordinal, or nominal.
If all variables are single homogeneity analysis is NLPCA (i.e. PCA-OS).This relationship follows from the form of the loss function.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiMultiple Variables
There is another relationship, which is already implicit in Guttman(1941). If we transform the variables to maximize the dominanteigenvalue of the correlation matrix, then we find both the first MCAdimension and the one-dimensional nominal PCA solution.
But there are deeper relations between MCA and NLPCA. These weredeveloped in a series of papers by De Leeuw and co-workers, starting in1980. Their analysis also elucidates the “Effect Guttman”.
These relationships are most easily illustrated by performing an MCA ofa continuous standardized multivariate normal, say on m variables,analyzed in the form of a Burt Table (with doubly-infinite subtables).Suppose the correlation matrix of this distribution is the m×m matrixR = {rj`}.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiMultinormal MCA
Suppose I = R[0],R = R[1],R[2], ... is the infinite sequence of Hadamard(elementwise) powers of R.
Suppose λ[s]j are the m eigenvalues of R[s] and y [s]
j are thecorresponding eigenvectors.
The eigenvalues of the MCA solution are the m×∞ eigenvalues λ[s]j .
The MCA eigenvector corresponding to λ[s]j consists of the m functions
y [s]j` Hs, with Hs the sth normalized Hermite polynomial.
An MCA eigenvector consists of m linear transformations, or mquadratic transformations, and so on. There are m linear eigenvectors,m quadratic eigenvectors, and so on.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiMultinormal MCA
The same theory applies to what Yule (1900) calls “strainedmultinormal” variables zj , in which there exists diffeomorphisms φj suchthat φj(zj) are jointly multinormal (an example are Gaussian copulas).
And the same theory also applies, except for the polynomial part, whenseparate transformations of the variables exists that linearize allbivariate regressions (this generalizes a result of Pearson from 1905).
Under all these scenarios, MCA solutions are NLPCA solutions, andvice versa.
With the provision that NLPCA solutions are always selected from thesame R[s], while MCA solutions come from all R[s].
Also, generally, the dominant eigenvalue is λ[1]1 and the second largest
one is either λ[2]1 or λ
[1]2 . In the first case we have a horseshoe.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
GifiBilinearizability
The “joint bilinearizibility” also occurs (trivially) if m = 2, i.e. in CA, and ifkj = 2 for all j , i.e. for binary variables.
If there is joint linearizability then the joint first-order asymptotic normaldistribution of the induced correlation coefficients does not depend onthe standard errors of the computed optimal transformations (no matterif they come from MCA or NLPCA or any other OS method).
There is additional horseshoe theory, due mostly to Schriever (1986),that uses the Krein-Gantmacher-Karlin theory of total positivity. It is notbased on families of orthogonal polynomials, but on (higher-order) orderrelations.
This was, once again, anticipated by Guttman (1950) who used finitedifference equations to derive the horseshoe MCA/NLPCA for the binaryitems defining a perfect scale.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
MorePavings
If we have a mapping of n objects into Rp then a categorical variablecan be used to label the objects.
The subsets corresponding to the categories of the variables aresupposed to be homogeneous.
This can be formalized either as being small or as being separated bylines or curves. There are many ways to quantify this in loss functions.
MCA (multiple variables, star plots) tends to think small (within vsbetween), NLPCA tends to think separable.
Guttman’s MSA defines outer and inner points of a category. An outerpoint is a closest point for any point not in the category.
The closest outer point for an inner point should belong to the samecategory as the inner point. This is a nice “topological” way to defineseparation, but it is hard to quantify.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
MoreAspects
The aspect approach (De Leeuw and Mair, JSS, 2010, using theoryfrom De Leeuw, 1990) goes back to Guttman’s 1941 original motivation.
An aspect is any function of all correlations (and/or the correkationratios) between m transformed variables.
Now choose the transformations/quantifications such that the aspect ismaximized. We use majorization to turns this into a sequence of leastsquares problems.
For MCA the aspect is the largest eigenvalue, for NLPCA it is the sum ofthe largest p eigenvalues.
Determinants, multiple correlations, canonical correlations can also beused as aspects.
Or: the sum of the differences of the correlation ratios and the squaredcorrelation coefficients.
Multinormal, strained multinormal, and bilinearizability theory applies toall aspects.
Jan de Leeuw
NLPCA History UCLA Department of Statistics
MoreLogistic
Instead of using least squares throughout, we can build a similar systemusing logit or probit log likelihoods. This is in the development stages.
The basic loss function, corresponding to the Gifi loss function, is
n
∑i=1
m
∑j=1
kj
∑`=1
gij` logexp{−φ(xi ,yj`)}
∑kjν=1 exp{−φ(xi ,yjν)}
where the data are indicators, as before. The function φ can bedistance, squared distance, or negative inner product.
This emphasizes separation, because we want the xi closest to the yj`
for which gij` = 1.
We use majorization to turn this into a sequence of reweighted leastsquares MDS or PCA problems.
Jan de Leeuw
NLPCA History UCLA Department of Statistics