Post on 10-Mar-2018
PCAmixPCArot
MFAmix
Multivariate analysis of mixed data: ThePCAmixdata R package
Marie ChaventIn collaboration with: Vanessa Kuentz, Amaury Labenne, Benoıt
Liquet, Jerome Saracco
University of Bordeaux, FranceInria Bordeaux Sud-Ouest, CQFD Team
Irstea, UR ADBX, cestas, FranceThe University of Queensland, Australia
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
Multivariate analysis of a mixture of numerical andcategorical data
Three main functions:
Function PCAmix for principal component analysis (PCA) ofmixed data.↪→ Includes standard PCA and MCA (multiple componentanalysis) as special cases.
Function PCArot for orthogonal rotation in PCAmix.↪→ Includes standard varimax rotation and rotation in MCA asspecial cases.
Function MFAmix for multiple factor analysis (MFA) formultiple-table mixed data.
https://github.com/chavent/PCAmixdata
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Outline
1 PCAmix
2 PCArot
3 MFAmix
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Principal component analysis of mixed data
Several implementations already in R:
Function FAMD in the R package FactoMineR.
↪→ Implements the method designed by Pages (2004).
Function dudi.mix in the R package ade4.
↪→ Implements the method of Hill & Smith (1976).
Function PCAmix in the R package PCAmixdata.↪→ Implements a single PCA with metrics based on a GSVDof preprocessed data.
⇒ Three different coding scheme but identical numerical results.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
A real data example
The gironde data are available in the package
library(PCAmixdata)
data(gironde)
They characterize living conditions in Gironde, a southwestregion in France.
542 cities are described with 27 variables separated into 4groups (Employment, Housing, Services, Environment).
↪→ Four datatables
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
A mixed data type example
The datatable housing is mixed.
↪→ 3 numerical and 2 categorical variables.
housing <- gironde$housing
head(housing)
## density primaryres owners houses council
## ABZAC 132 89 64 inf 90% sup 5%
## AILLAS 21 88 77 sup 90% inf 5%
## AMBARES 532 95 66 inf 90% sup 5%
## AMBES 101 94 67 sup 90% sup 5%
## ANDERNOS 552 62 72 inf 90% inf 5%
## ANGLADE 64 81 81 sup 90% inf 5%
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Two data sets:
↪→ a numerical data matrix X1 of dimension 542× 3.
↪→ a categorical data matrix X2 of dimension 542× 2.
split<-splitmix(housing)
X1<-split$X.quanti
X2<-split$X.quali
head(X1)
## density primaryres owners
## ABZAC 132 89 64
## AILLAS 21 88 77
## AMBARES 532 95 66
## AMBES 101 94 67
## ANDERNOS 552 62 72
## ANGLADE 64 81 81
head(X2)
## houses council
## ABZAC inf 90% sup 5%
## AILLAS sup 90% inf 5%
## AMBARES inf 90% sup 5%
## AMBES sup 90% sup 5%
## ANDERNOS inf 90% inf 5%
## ANGLADE sup 90% inf 5%
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
The PCAmix method
An simple algorithm in three main steps
1 Preprocessing step.
2 GSVD (Generalized Singular Value Decomposition) step.
3 Scores processing step.
Some notations:
- Let X1 be a n × p1 numerical data matrix.
- Let X2 be a n × p2 categorical data matrix.
- Let m be the total number of categories.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Preprocessing step
1 Build a numerical data matrix Z = (Z1|Z2) of dimensionn × (p1 + m) with:
↪→ Z1 the standardized version of the matrix X1.↪→ Z2 the centered indicator matrix of the levels of X2.
2 Build the diagonal matrix N of the weights of the rows.
↪→ The n rows are weighted by 1n .
3 Build the diagonal matrix M of the weights of the columns.
↪→ The p1 first columns are weighted by 1.↪→ The m last columns are weighted by n
ns, with ns the number of
observations with level s.
↪→ The total variance is p1 + m − p2.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
GSVD step
The GSVD (Generalized Singular Value Decomposition) of Z withthe metrics N and M gives the decompositon
Z = UDVt (1)
where
- D = diag(√λ1, . . . ,
√λr ) is the r × r diagonal matrix of the
singular values of ZMZtN and ZtNZM, and r denotes therank of Z;
- U is the n × r matrix of the first r eigenvectors of ZMZtNsuch that UtNU = Ir ;
- V is the p × r matrix of the first r eigenvectors of ZtNZMsuch that VtMV = Ir .
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Scores processing step
1 The set of factor scores for rows is computed as:
F = UD.
↪→ Principal Component scores
2 The set of factor scores for colums is computed as:
A = MVD.
↪→ In standard PCA: A = VD.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
The R function
args(PCAmix)
## function (X.quanti = NULL, X.quali = NULL, ndim = 5, rename.level = FALSE,
## weight.col = NULL, weight.row = NULL, graph = TRUE)
## NULL
PCAmix(X.quanti=X1,X.quali=X2,ndim=2)
##
## Call:
## PCAmix(X.quanti = X1, X.quali = X2, ndim = 2)
##
## Method = Principal Component of mixed data (PCAmix)
##
##
## name description
## [1,] "$eig" "eigenvalues of the principal components (PC) "
## [2,] "$ind" "results for the individuals (coord,contrib,cos2)"
## [3,] "$quanti" "results for the quantitative variables (coord,contrib,cos2)"
## [4,] "$levels" "results for the levels of the qualitative variables (coord,contrib,cos2)"
## [5,] "$quali" "results for the qualitative variables (contrib,relative contrib)"
## [6,] "$sqload" "squared loadings"
## [7,] "$coef" "coef of the linear combinations defining the PC"
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Principal Component scores of the observations
obj <- PCAmix(X.quanti=X1,X.quali=X2,ndim=2)
head(obj$ind$coord)
## dim 1 dim 2
## ABZAC 2.36 0.024
## AILLAS -0.88 0.123
## AMBARES 2.62 0.800
## AMBES 0.93 0.919
## ANDERNOS 1.18 -2.481
## ANGLADE -1.01 -0.424
plot(obj,choice="ind")
−2 0 2 4 6 8 10
−6
−4
−2
02
Individuals component map
Dim 1 (50.54 %)
Dim
2 (
21.3
9 %
)ABZACAILLAS
AMBARESAMBES
ANDERNOS
ANGLADE
ARBANATSARBIS
ARCACHON
ARCINS
ARES
ARSAC ARTIGUES−PRES−BORDEAUXARTIGUES−DE−LUSSAC
ARVEYRESASQUES
AUBIACAUBIE−ET−ESPESSAS
AUDENGEAURIOLLES
AUROS
AVENSANAYGUEMORTE−LES−GRAVES
BAGASBAIGNEAUXBALIZACBARIE
BARONBARP
BARSACBASSANNE
BASSENSBAURECHBAYAS
BAYON−SUR−GIRONDEBAZAS
BEAUTIRAN
BEGADAN
BEGLES
BEGUEY
BELIN−BELIETBELLEBAT
BELLEFONDBELVES−DE−CASTILLONBERNOS−BEAULAC
BERSONBERTHEZ
BEYCHAC−ET−CAILLAUBIEUJAC BIGANOS
BILLAUXBIRACBLAIGNAC
BLAIGNAN
BLANQUEFORT
BLASIMON
BLAYE
BLESIGNAC
BOMMES
BONNETANBONZAC
BORDEAUX
BOSSUGAN
BOULIAC
BOURDELLES BOURG
BOURIDEYS
BOUSCAT
BRACH BRANNEBRANNENS
BRAUD−ET−SAINT−LOUIS
BROUQUEYRAN
BRUGES
BUDOS
CABANAC−ET−VILLAGRAINS
CABARA
CADARSAC CADAUJAC
CADILLAC
CADILLAC−EN−FRONSADAISCAMARSACCAMBES
CAMBLANES−ET−MEYNAC
CAMIAC−ET−SAINT−DENISCAMIRANCAMPS−SUR−L'ISLE
CAMPUGNAN
CANEJANCANTENAC
CANTOISCAPIAN
CAPLONG
CAPTIEUX
CARBON−BLANC
CARCANS
CARDAN
CARIGNAN−DE−BORDEAUX
CARS
CARTELEGUE
CASSEUIL
CASTELMORON−D'ALBRET
CASTELNAU−DE−MEDOCCASTELVIELCASTETS−EN−DORTHECASTILLON−DE−CASTETS
CASTILLON−LA−BATAILLE
CASTRES−GIRONDE
CAUDROTCAUMONT
CAUVIGNAC
CAVIGNAC
CAZALIS
CAZATS
CAZAUGITAT
CENAC
CENON
CERONSCESSAC
CESTAS
CEZAC
CHAMADELLECISSAC−MEDOC
CIVRAC−DE−BLAYE
CIVRAC−SUR−DORDOGNE
CIVRAC−EN−MEDOC
CLEYRAC
COIMERES
COIRAC
COMPS
COUBEYRAC
COUQUEQUES
COURPIAC
COURS−DE−MONSEGURCOURS−LES−BAINS
COUTRASCOUTURES
CREONCROIGNONCUBNEZAISCUBZAC−LES−PONTS
CUDOS
CURSAN
CUSSAC−FORT−MEDOCDAIGNAC
DARDENACDAUBEZE
DIEULIVOL
DONNEZACDONZACDOULEZONEGLISOTTES−ET−CHALAURES
ESCAUDES
ESCOUSSANSESPIETESSEINTES
ETAULIERS
EYNESSE
EYRANS
EYSINES
FALEYRASFARGUESFARGUES−SAINT−HILAIREFIEU
FLOIRAC
FLAUJAGUESFLOUDES
FONTETFOSSES−ET−BALEYSSACFOURS
FRANCS
FRONSACFRONTENACGABARNACGAILLAN−EN−MEDOCGAJACGALGONGANS
GARDEGAN−ET−TOURTIRACGAURIAC
GAURIAGUET
GENERAC
GENISSACGENSACGIRONDE−SUR−DROPTGISCOS
GORNAC
GOUALADE
GOURS
GRADIGNAN
GRAYAN−ET−L'HOPITAL
GREZILLACGRIGNOLS
GUILLACGUILLOS
GUITRES
GUJAN−MESTRAS
HAILLAN
HAUX
HOSTENS
HOURTIN
HUREILLATSISLE−SAINT−GEORGES
IZON
JAU−DIGNAC−ET−LOIRAC
JUGAZAN
JUILLAC
LABARDE
LABESCAU
BREDE
LACANAU
LADAUXLADOSLAGORCE
LANDE−DE−FRONSAC
LAMARQUELAMOTHE−LANDERRONLALANDE−DE−POMEROL
LANDERROUATLANDERROUET−SUR−SEGUR
LANDIRAS
LANGOIRAN LANGON
LANSAC
LANTON
LAPOUYADE
LAROQUE
LARTIGUE
LARUSCADELATRESNELAVAZAN
LEGE−CAP−FERRET
LEOGEATSLEOGNAN
LERM−ET−MUSSETLESPARRE−MEDOC
LESTIAC−SUR−GARONNE
LEVES−ET−THOUMEYRAGUES
LIBOURNELIGNAN−DE−BAZASLIGNAN−DE−BORDEAUXLIGUEUX
LISTRAC−DE−DUREZELISTRAC−MEDOC
LORMONT
LOUBENSLOUCHATS
LOUPES
LOUPIAC
LOUPIAC−DE−LA−REOLE
LUCMAU
LUDON−MEDOC
LUGAIGNAC
LUGASSON
LUGON−ET−L'ILE−DU−CARNAY
LUGOSLUSSAC
MACAUMADIRACMARANSINMARCENAIS
MARCILLAC MARGAUXMARGUERON
MARIMBAULT
MARIONS
MARSAS MARTIGNAS−SUR−JALLEMARTILLAC
MARTRES
MASSEILLESMASSUGAS
MAURIAC
MAZERES
MAZION
MERIGNAC
MERIGNAS
MESTERRIEUX
MIOSMOMBRIERMONGAUZY
MONPRIMBLANCMONSEGUR
MONTAGNE
MONTAGOUDINMONTIGNAC
MONTUSSAN
MORIZESMOUILLAC
MOULIETS−ET−VILLEMARTINMOULIS−EN−MEDOCMOULONMOURENS
NAUJAC−SUR−MER
NAUJAN−ET−POSTIAC
NEAC
NERIGEAN
NEUFFONSNIZAN
NOAILLAC
NOAILLAN
OMET
ORDONNAC
ORIGNE PAILLET
PAREMPUYRE
PAUILLAC
PEINTURES
PELLEGRUE
PERISSACPESSAC
PESSAC−SUR−DORDOGNE
PETIT−PALAIS−ET−CORNEMPS
PEUJARDPIAN−MEDOC
PIAN−SUR−GARONNE
PINEUILHPLASSAC
PLEINE−SELVE
PODENSAC
POMEROL
POMPEJAC
POMPIGNAC
PONDAURATPORCHERES
PORGE
PORTETSPOUT
PRECHAC
PREIGNACPRIGNAC−EN−MEDOCPRIGNAC−ET−MARCAMPS PUGNAC
PUISSEGUIN
PUJOLS−SUR−CIRONPUJOLS
PUY
PUYBARBANPUYNORMAND
QUEYRAC
QUINSAC
RAUZANREIGNAC
REOLERIMONSRIOCAUD
RIONS
RIVIERE
ROAILLAN
ROMAGNEROQUEBRUNEROQUILLERUCH
SABLONSSADIRAC
SAILLANSSAINT−AIGNAN
SAINT−ANDRE−DE−CUBZACSAINT−ANDRE−DU−BOISSAINT−ANDRE−ET−APPELLES
SAINT−ANDRONY
SAINT−ANTOINE
SAINT−ANTOINE−DU−QUEYRETSAINT−ANTOINE−SUR−L'ISLE
SAINT−AUBIN−DE−BLAYESAINT−AUBIN−DE−BRANNE
SAINT−AUBIN−DE−MEDOC
SAINT−AVIT−DE−SOULEGE
SAINT−AVIT−SAINT−NAZAIRESAINT−BRICESAINT−CAPRAIS−DE−BLAYE
SAINT−CAPRAIS−DE−BORDEAUX
SAINT−CHRISTOLY−DE−BLAYE
SAINT−CHRISTOLY−MEDOC
SAINT−CHRISTOPHE−DES−BARDES
SAINT−CHRISTOPHE−DE−DOUBLE
SAINT−CIBARDSAINT−CIERS−D'ABZAC
SAINT−CIERS−DE−CANESSE
SAINT−CIERS−SUR−GIRONDE
SAINTE−COLOMBE
SAINT−COME
SAINTE−CROIX−DU−MONTSAINT−DENIS−DE−PILE
SAINT−EMILIONSAINT−ESTEPHE
SAINT−ETIENNE−DE−LISSE
SAINTE−EULALIESAINT−EXUPERYSAINT−FELIX−DE−FONCAUDE
SAINT−FERME
SAINTE−FLORENCE
SAINTE−FOY−LA−GRANDESAINTE−FOY−LA−LONGUE
SAINTE−GEMME
SAINT−GENES−DE−BLAYESAINT−GENES−DE−CASTILLON
SAINT−GENES−DE−FRONSACSAINT−GENES−DE−LOMBAUDSAINT−GENIS−DU−BOIS
SAINT−GERMAIN−DE−GRAVESAINT−GERMAIN−D'ESTEUIL
SAINT−GERMAIN−DU−PUCHSAINT−GERMAIN−DE−LA−RIVIERESAINT−GERVAISSAINT−GIRONS−D'AIGUEVIVES
SAINTE−HELENE
SAINT−HILAIRE−DE−LA−NOAILLESAINT−HILAIRE−DU−BOIS
SAINT−HIPPOLYTE
SAINT−JEAN−DE−BLAIGNAC
SAINT−JEAN−D'ILLAC
SAINT−JULIEN−BEYCHEVELLE
SAINT−LAURENT−MEDOC
SAINT−LAURENT−D'ARCE
SAINT−LAURENT−DES−COMBES
SAINT−LAURENT−DU−BOISSAINT−LAURENT−DU−PLAN
SAINT−LEGER−DE−BALSONSAINT−LEONSAINT−LOUBERT
SAINT−LOUBES
SAINT−LOUIS−DE−MONTFERRAND
SAINT−MACAIRE
SAINT−MAGNESAINT−MAGNE−DE−CASTILLONSAINT−MAIXANTSAINT−MARIENSSAINT−MARTIAL
SAINT−MARTIN−LACAUSSADE
SAINT−MARTIN−DE−LAYESAINT−MARTIN−DE−LERM
SAINT−MARTIN−DE−SESCASSAINT−MARTIN−DU−BOIS
SAINT−MARTIN−DU−PUYSAINT−MEDARD−DE−GUIZIERES
SAINT−MEDARD−D'EYRANSSAINT−MEDARD−EN−JALLES
SAINT−MICHEL−DE−CASTELNAU
SAINT−MICHEL−DE−FRONSAC
SAINT−MICHEL−DE−RIEUFRET
SAINT−MICHEL−DE−LAPUJADE
SAINT−MORILLON
SAINT−PALAIS
SAINT−PARDON−DE−CONQUES
SAINT−PAUL
SAINT−PEY−D'ARMENS
SAINT−PEY−DE−CASTETSSAINT−PHILIPPE−D'AIGUILLE
SAINT−PHILIPPE−DU−SEIGNALSAINT−PIERRE−D'AURILLAC
SAINT−PIERRE−DE−BAT
SAINT−PIERRE−DE−MONS
SAINT−QUENTIN−DE−BARON
SAINT−QUENTIN−DE−CAPLONG
SAINTE−RADEGONDESAINT−ROMAIN−LA−VIRVEESAINT−SAUVEUR
SAINT−SAUVEUR−DE−PUYNORMAND
SAINT−SAVIN
SAINT−SELVE
SAINT−SEURIN−DE−BOURG
SAINT−SEURIN−DE−CADOURNE
SAINT−SEURIN−DE−CURSAC
SAINT−SEURIN−SUR−L'ISLE
SAINT−SEVESAINT−SULPICE−DE−FALEYRENS
SAINT−SULPICE−DE−GUILLERAGUESSAINT−SULPICE−DE−POMMIERS
SAINT−SULPICE−ET−CAMEYRAC
SAINT−SYMPHORIENSAINTE−TERRESAINT−TROJAN
SAINT−VINCENT−DE−PAUL
SAINT−VINCENT−DE−PERTIGNASSAINT−VIVIEN−DE−BLAYE
SAINT−VIVIEN−DE−MEDOC
SAINT−VIVIEN−DE−MONSEGUR SAINT−YZAN−DE−SOUDIAC
SAINT−YZANS−DE−MEDOC
SALAUNESSALIGNAC
SALLEBOEUF
SALLESSALLES−DE−CASTILLON
SAMONACSAUCATS
SAUGON
SAUMOS
SAUTERNES
SAUVE
SAUVETERRE−DE−GUYENNE
SAUVIACSAVIGNAC
SAVIGNAC−DE−L'ISLE
SEMENS
SENDETS
SIGALENS
SILLAS
SOULAC−SUR−MER
SOULIGNAC
SOUSSAC
SOUSSANSTABANAC
TAILLAN−MEDOC
TAILLECAVAT
TALAIS
TALENCE
TARGON
TARNES
TAURIAC
TAYAC TEICH
TEMPLE
TESTE−DE−BUCH
TEUILLACTIZAC−DE−CURTONTIZAC−DE−LAPOUYADE TOULENNETOURNE
TRESSES
TUZANUZESTE
VALEYRAC
VAYRES
VENDAYS−MONTALIVET
VENSAC
VERAC
VERDELAIS
VERDON−SUR−MER
VERTHEUILVIGNONET
VILLANDRAUT
VILLEGOUGEVILLENAVE−DE−RIONS
VILLENAVE−D'ORNON
VILLENEUVE
VIRELADEVIRSAC YVRACMARCHEPRIME
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Loadings of the numerical variables
head(obj$quanti.cor)
## dim1 dim2
## density 0.704 0.25
## primaryres -0.019 0.97
## owners -0.858 0.13
#Component map with factor scores of the numerical columns
plot(obj,choice="cor")
−2 −1 0 1 2
−1.
0−
0.5
0.0
0.5
1.0
Correlation circle
Dim 1 (50.54 %)
Dim
2 (
21.3
9 %
) density
primaryres
owners
↪→ The (non standardized) loadings are correlations.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Scores of the levels of the categorical variables
head(obj$categ.coord)
## dim1 dim2
## inf 90% 1.63 -0.339
## sup 90% -0.42 0.087
## inf 5% -0.40 -0.065
## sup 5% 1.52 0.245
plot(obj,choice="levels")
−0.5 0.0 0.5 1.0 1.5 2.0
−0.
4−
0.3
−0.
2−
0.1
0.0
0.1
0.2
0.3
Levels component map
Dim 1 (50.54 %)
Dim
2 (
21.3
9 %
)
inf 90%
sup 90%
inf 5%
sup 5%
↪→ The barycentric property is still verified.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Contributions of the variables
#contributions of the variables
head(obj$sqload)
## dim1 dim2
## density 0.49550 0.061
## primaryres 0.00035 0.946
## owners 0.73651 0.017
## houses 0.68226 0.030
## council 0.61226 0.016
plot(obj,choice="sqload",coloring.var="type")
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Squared loadings
Dim 1 (50.54 %)
Dim
2 (
21.3
9 %
)
density
primaryres
ownershousescouncil
numericalcategorical
The contribution cjα of a variable j to the component α is:{cjα = a2
jα = Squared correlation if variable j is numerical,
cjα =∑
s∈Ijnnsa2sα = Correlation ratio if variable j is categorical.
↪→ Called squared loadings in varimax criterion for PC rotation.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
Principal components prediction
Each principal component fα writes as a linear combination of thecolumns of X = (X1|G) where X1 is the numerical data matrix andG is the indicator matrix of the levels of the matrix X2:
fα = β0 +
p1+m∑j=1
βjxj
with:
β0 = −p1∑k=1
vkαxksk−
p1+m∑k=p1+1
vkα,
βj = vjα1
sj, for j = 1, . . . , p1
βj = vjαn
nj, for j = p1 + 1, . . . , p1 + m
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmix
A real data exampleThe PCAmix methodPrincipal component prediction
The method predict()Coefficients of the PC found on the learning set (without the 5 first cities)
test <- 1:5
obj2 <- PCAmix(X1[-test,],X2[-test,],graph=FALSE,ndim=3)
data.frame(obj2$coef)
## dim1 dim2 dim3
## const 3.64214 -9.02951 1.5950
## density 0.00086 0.00048 0.0015
## primaryres -0.00129 0.09355 -0.0179
## owners -0.05065 0.01207 -0.0047
## inf 90% 1.03579 -0.32092 -0.4617
## sup 90% -0.26076 0.08079 0.1162
## inf 5% -0.25129 -0.05706 0.2704
## sup 5% 0.96441 0.21898 -1.0379
Scores of the 5 first cities on the principal components
predict(obj2,X1[test,],X2[test,])
## dim1 dim2 dim3
## ABZAC 2.39 0.011 -1.595
## AILLAS -0.87 0.122 0.084
## AMBARES 2.65 0.795 -1.098
## AMBES 0.94 0.895 -1.164
## ANDERNOS 1.19 -2.466 0.800
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
Outline
1 PCAmix
2 PCArot
3 MFAmix
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
Varimax type rotation in PCAmix
Let us introduce
T an orthonormal rotation matrix: TT′ = T′T = Ikk is the number of dimensions in the rotation procedure
⇒ Rotate the loading matrix and the principal components sothat groups of variables appear: having high loadings on thesame component and negligible ones on the remainingcomponents.
⇒ In PCAmix rotated squared loadings cjα are correlations orcorrelation ratios for categorical variables.
⇒ The varimax function writes:
f (T) =k∑
α=1
p∑j=1
(cjα)2 − 1
p
k∑α=1
p∑j=1
cjα
2
. (2)
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
Find the optimal rotation matrix T
The varimax rotation problem is formulated as
maxT
{f (T)|TT′ = T′T = Ik}, (3)
⇒ An iterative procedure based on successive planar rotations.
⇒ Direct solution for the optimal angle of rotation (Chavent &al. 2012).
⇒ Reduces to the Kaiser (1958) for numerical data.
⇒ Performs rotation in MCA for categorical data.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
The R function
obj <- PCAmix(X.quanti=X1, X.quali=X2, rename.level=TRUE, graph=FALSE)
rot <- PCArot(obj,dim=3)
rot
##
## Call:
## PCArot(obj = obj, dim = 3)
##
## Method = rotation after Principal Component of mixed data (PCAmix)
## number of iterations: 4
##
## name description
## [1,] "$eig" "variances of the 'ndim' first dimensions after rotation"
## [2,] "$ind" "results for the individuals after rotation (coord)"
## [3,] "$quanti" "results for the quantitative variables (coord) after rotation"
## [4,] "$levels" "results for the levels of the qualitative variables (coord) after rotation"
## [5,] "$quali" "results for the qualitative variables (coord) after rotation "
## [6,] "$sqload" "squared loadings after rotation"
## [7,] "$coef" "coef of the linear combinations defining the rotated PC"
## [8,] "$theta" "angle of rotation if 'dim'=2"
## [9,] "$T" "matrix of rotation"
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
The method plot()
plot(obj,choice="sqload",coloring.var="type",
leg=TRUE,axes=c(1,3), posleg="topright",
main="Squared loadings before rotation")
0.0 0.2 0.4 0.6 0.8
−0.
10.
00.
10.
20.
30.
40.
5
Squared loadings before rotation
Dim 1 (50.54 %)
Dim
3 (
12.6
1 %
)
density
primaryres
owners
houses
council
numericalcategorical
plot(rot,choice="sqload",coloring.var="type",
leg=TRUE,axes=c(1,3),posleg="topright",
main="Squared loadings after rotation")
0.0 0.2 0.4 0.6 0.8
0.0
0.2
0.4
0.6
0.8
1.0
Squared loadings after rotation
Dim 1 (38.52 %)
Dim
3 (
24.8
8 %
)
density
primaryres
owners
houses
council
numericalcategorical
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
The method plot()
plot(obj,choice="ind",label=FALSE,axes=c(1,3),
main="Observations before rotation")
−2 0 2 4 6 8 10
−2
02
46
Observations before rotation
Dim 1 (50.54 %)
Dim
3 (
12.6
1 %
)
plot(rot,choice="ind",label=FALSE,axes=c(1,3),
main="Observations after rotation")
−1 0 1 2 3 4
−2
02
46
810
12
Observations after rotation
Dim 1 (38.52 %)D
im 3
(24
.88
%)
↪→ Prediction of scores of new observations on the rotatedprincipal components
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixVarimax orthogonal rotation in PCAmix
The method predict()Coefficients of the PC found on the learning set (without the 5 first cities)
obj2 <- PCAmix(X1[-test,],X2[-test,],graph=FALSE)
rot2 <- PCArot(obj2,dim=3)
data.frame(rot2$coef)
## dim1.rot dim2.rot dim3.rot
## const 2.00163 -9.3e+00 1.3741
## density -0.00094 3.9e-05 0.0022
## primaryres 0.00688 9.6e-02 -0.0014
## owners -0.03313 1.4e-02 -0.0228
## inf 90% 1.23247 -2.2e-01 -0.1816
## sup 90% -0.31027 5.5e-02 0.0457
## inf 5% -0.44031 -1.2e-01 0.1958
## sup 5% 1.68985 4.6e-01 -0.7514
Scores of the 5 first cities on the rotated principal components
predict(rot2,X1[test,],X2[test,])
## dim1.rot dim2.rot dim3.rot
## ABZAC 3.28 0.36 -0.865
## AILLAS -0.72 0.12 -0.223
## AMBARES 2.90 0.98 -0.037
## AMBES 1.73 1.15 -0.764
## ANDERNOS 0.33 -2.64 0.868
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
Outline
1 PCAmix
2 PCArot
3 MFAmix
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
Multiple Factor Analysis for mixed data
⇒ Analyze a set of observations described by several groups ofvariables.
⇒ Make all the groups of variables comparable in the PCAanalysis by introducing weights: the weight of a variable is theinverse of the variance of the first principal component of itsgroup.
⇒ In the function MFA in the package FactoMineR, the natureof the variables (categorical or numerical) can vary from onegroup to another, but the variables should be of the sametype within a given group.
⇒ MFAmix is able to handle mixed data within a group ofvariables.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
The R function
dat<-cbind(gironde$employment,gironde$housing, gironde$services,gironde$environment)
#definition of the groups of variables
class.var<-c(rep(1,9),rep(2,5),rep(3,9),rep(4,4))
#names of the groups of variables
names<-c("employment","housing","services","environment")
#Perform MFAmix
obj3<-MFAmix(data=dat,groups=class.var,name.groups=names,ndim=3,rename.level=TRUE,graph=FALSE)
obj3
## **Results of the Multiple Factor Analysis for mixed data (MFAmix)**
## The analysis was performed on 542 individuals, described by 27 variables
## *Results are available in the following objects :
##
## name description
## 1 "$eig" "eigenvalues"
## 2 "$eig.separate" "eigenvalues of the separate analyses"
## 3 "$separate.analyses" "separate analyses for each group of variables"
## 4 "$groups" "results for all the groups"
## 5 "$partial.axes" "results for the partial axes"
## 6 "$ind" "results for the individuals"
## 7 "$ind.partial" "results for the partial individuals"
## 8 "$quanti" "results for the quantitative variables"
## 9 "$levels" "results for the levels of the qualitative variables"
## 10 "$quali" "results for the qualitative variables"
## 11 "$sqload" "squared loadings"
## 12 "$global.pca" "results for the global PCA"
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
Some graphical output
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
(a) Correlation circle
Dim 1 (21.78 %)
Dim
2 (
10.9
9 %
)
employmenthousingservicesenvironment
farmers
tradesmen
managers
workers unemployed
middleempl
retired
employrateincome
density
primaryresowners
buildingwater
vegetation
agricul
−1.0 −0.5 0.0 0.5 1.0
−1.
0−
0.5
0.0
0.5
1.0
(b) Partial axes
Dim 1 (21.78 %)
Dim
2 (
10.9
9 %
)
dim 1.employment
dim 2.employment
dim 3.employment
dim 1.housing
dim 2.housing
dim 3.housing
dim 1.servicesdim 2.servicesdim 3.services
dim 1.environment
dim 2.environment
dim 3.environment
employmenthousingservicesenvironment
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
(c) Groups representation
Dim 1 (21.78 %)
Dim
2 (
10.9
9 %
)
employment
housing
services
environment
0 5 10 15
−10
−5
05
(d) Partial observations
Dim 1 (21.78 %)
Dim
2 (
10.9
9 %
)
employmenthousingservicesenvironment
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
Other packages working with mixed data
⇒ ClustOfVar for the clustering of variables.
⇒ divclust for the divisive and monothetic clustering ofobservations.
⇒ ClustGeo for the clustering with geographical constraints(very soon available for mixed data).
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
Some references
Beaton, D., Chin Fatt, C. R., Abdi, H. (2014). An ExPosition ofmultivariate analysis with the singular value decomposition in R.Computational Statistics & Data Analysis, 72, 176-189.
Chavent, M., Kuentz, V., Liquet B., Saracco, J. (2012), ClustOfVar: An RPackage for the Clustering of Variables. Journal of Statistical Software 50,1-16.
Chavent, M., Kuentz, V., Saracco, J. (2012), Orthogonal Rotation inPCAMIX. Advances in Data Analysis and Classification 6, 131-146.
Chavent, M., Kuentz, V., Saracco, J. (2012), Multivariate analysis ofmixed type data: The PCAmixdata R package. arXiv:1411.4911v1.
Dray, S., Dufour, A., 2007. The ade4 package: implementing the dualitydiagram for ecologists. Journal of Statistical Software 22 (4), 120.
Le, S., Josse, J., Husson, F., et al. (2008). Factominer: an R package formultivariate analysis. Journal of Statistical Software 25 (1), 118.
UseR!2015 June 30 - July 3, Aalborg, Denmark
PCAmixPCArot
MFAmixMultiple Factor Analysis for mixed data
THANK YOU !
UseR!2015 June 30 - July 3, Aalborg, Denmark