An R package for exploratory data analysis for teaching ...

35
An R package for exploratory data analysis for teaching and research François Husson , Julie Josse & Sébastien Lê

Transcript of An R package for exploratory data analysis for teaching ...

Page 1: An R package for exploratory data analysis for teaching ...

An R package for exploratory data

analysis for teaching and research

François Husson, Julie Josse & Sébastien Lê

Page 2: An R package for exploratory data analysis for teaching ...

To make exploratory multivariate data analysis with a free

software

The possibility to propose new methods (taking into

account different structure on the data)

To have a package user friendly and oriented to

practitioner (a very easy GUI)

Why ?

Page 3: An R package for exploratory data analysis for teaching ...

Methods implemented are similar in their main objective: to

sum up and simplify the data by reducing the dimensionality

of the dataset

Continuous variables: Principal Components Analysis

Contingency table: Correspondence Analysis

Categorical variables: Multiple Correspondence Analysis

Continuous and categorical variables: Mixed Data Analysis

1 – The classical methods

Page 4: An R package for exploratory data analysis for teaching ...

Data : performances of 41 athletes during two meetings of decathlon

100

m

Lo

ng

.jum

p

Sh

ot.

pu

t

Hig

h.ju

mp

40

0m

110

m.h

urd

le

Dis

cu

s

Po

le.v

ault

Jav

elin

e

150

0m

Ra

nk

Po

ints

Co

mp

eti

tio

n

SEBRLE 11.04 7.58 14.83 2.07 49.81 14.69 43.75 5.02 63.19 291.70 1 8217 Decastar CLAY 10.76 7.40 14.26 1.86 49.37 14.05 50.72 4.92 60.15 301.50 2 8122 Decastar KARPOV 11.02 7.30 14.77 2.04 48.37 14.09 48.95 4.92 50.31 300.20 3 8099 Decastar BERNARD 11.02 7.23 14.25 1.92 48.93 14.99 40.87 5.32 62.77 280.10 4 8067 Decastar YURKOV 11.34 7.09 15.19 2.10 50.42 15.31 46.26 4.72 63.44 276.40 5 8036 Decastar Sebrle 10.85 7.84 16.36 2.12 48.36 14.05 48.72 5.00 70.52 280.01 1 8893 OlympicG Clay 10.44 7.96 15.23 2.06 49.19 14.13 50.11 4.90 69.71 282.00 2 8820 OlympicG Karpov 10.50 7.81 15.93 2.09 46.81 13.97 51.65 4.60 55.54 278.11 3 8725 OlympicG Macey 10.89 7.47 15.73 2.15 48.97 14.56 48.34 4.40 58.46 265.42 4 8414 OlympicG Warners 10.62 7.74 14.48 1.97 47.97 14.01 43.73 4.90 55.39 278.05 5 8343 OlympicG

PCA Example

Page 5: An R package for exploratory data analysis for teaching ...

PCA example

Introduction of

supplementary information:• supplementary continuous

variables

Graphs enriched by :• representing the variables

according to their quality of

representation

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Variables factor map (PCA)

Dimension 1 (32.72%)

Dim

ens

ion

2 (

17.3

7%

)

100m

Long.jump

Shot.put

High.jump

400m

110m.hurdle

Discus

Javeline

1500m

RankPoints

Indicators:• contribution

• quality of representation

Page 6: An R package for exploratory data analysis for teaching ...

Introduction of supplementary information:• supplementary individuals

• supplementary categorical variables

Graphs enriched by:• coloring according to

supplementary

information

• confidence ellipses

around the categories

-4 -2 0 2 4 6

-4-2

02

4

Individuals factor map (PCA)

Dimension 1 (32.72%)

Dim

ens

ion

2 (

17.

37%

)

SEBRLECLAYKARPOV

BERNARD

YURKOV

WARNERS

ZSIVOCZKY

McMULLENMARTINEAU

HERNU

BARRAS

NOOL

BOURGUIGNON

Sebrle

Clay

Karpov

Macey

Warners

Zsivoczky

Hernu

Nool

Bernard

Schwarzl

Pogorelov

Schoenbeck

Barras

Smith

Averyanov

Ojaniemi

Smirnov

Qi

Drews

Parkhomenko

Terek

Gomez

Turi

Lorenzo

Karlivans

Korkizoglou

Uldal

Casarsa

DecastarOlympicG

Decastar

OlympicG

PCA example

Indicators:• contribution

• quality of representation

Page 7: An R package for exploratory data analysis for teaching ...

PCA example

-4 -2 0 2 4

Dimension 1 (32.71 %)

-2

0

2

4

Dim

ensi

on 2

(17

.37

%)

DecastarOlympicG

Schoenbeck

SEBRLECLAY

KARPOV

BERNARD

YURKOV

WARNERS

ZSIVOCZKY

McMULLEN

MARTINEAUHERNU

BARRAS

NOOL

BOURGUIGNON

Sebrle

Clay

Karpov

Macey

Warners

Zsivoczky

Hernu

Nool

Bernard

Schwarzl

Pogorelov

Barras

Smith

Averyanov

Ojaniemi

Smirnov

Qi

Drews

Parkhomenko

Terek

Gomez

Turi

Lorenzo

Karlivans

Korkizoglou

Uldal

Casarsa

Nb points

Page 8: An R package for exploratory data analysis for teaching ...

PCA example

-4 -2 0 2 4

Dimension 1 (32.71 %)

-2

0

2

4

Dim

ensi

on 2

(17

.37

%)

DecastarOlympicG

SEBRLECLAY

KARPOV

BERNARD

YURKOV

WARNERS

ZSIVOCZKY

McMULLEN

MARTINEAUHERNU

BARRAS

NOOL

BOURGUIGNON

Sebrle

Clay

Karpov

Macey

Warners

Zsivoczky

Hernu

Nool

Bernard

Schwarzl

Pogorelov

Schoenbeck

Barras

Smith

Averyanov

Ojaniemi

Smirnov

Qi

Drews

Parkhomenko

Terek

Gomez

Turi

Lorenzo

Karlivans

Korkizoglou

Uldal

Casarsa

Pole.vault

Page 9: An R package for exploratory data analysis for teaching ...

Description of the dimensions

By the quantitative variables:

• The correlation between each variable and the coordinate of the individuals on the axis s is calculated

• The correlation coefficients are sorted • Only the significant correlations are given

$Dim.1$Dim.1$quanti Dim.1Points 0.96Long.jump 0.74Shot.put 0.62Rank -0.67400m -0.68110m.hurdle -0.75100m -0.77

$Dim.2$Dim.2$quanti Dim.2Discus 0.61Shot.put 0.60

Significant level = 0.05

Best variable to describe the 1st dimension

Page 10: An R package for exploratory data analysis for teaching ...

Description of the dimensions

By the qualitative variables:

• Perform a one-way analysis of variance with the coordinates of

the individuals on the axis explained by the qualitative variable

Significant level = 0.2

$Dim.1$quali P-valueCompetition 0.155

$Dim.1$category Estimate P-valueOlympicG 0.4393 0.155Decastar -0.4393 0.155

• For each category, a student

T-test to compare the average

of the category with the general

mean

• A F-test by variable

Page 11: An R package for exploratory data analysis for teaching ...

2 – Structure on the data

Different structure on the data are proposed:

a partition on the variables: several sets of variables are

simultaneously studied: Multiple Factor Analysis,

Generalized Procrustes Analysis

a hierarchy on the variables: variables are grouped and

subgrouped (like in questionnaires structured in topics

and subtopics): Hierarchical Multiple Factor Analysis

a partition on the individuals: several sets of individuals

described by the same variables: Dual Multiple Factor

Analysis

Page 12: An R package for exploratory data analysis for teaching ...

Groups of variables (MFA)

Groups of

variables are

quantitative and/

or qualitative

Objectives : - study the link between the sets of variables - balance the influence of each group of variables - give the classical graphs but also specific graphs:

groups of variables - partial representation

Examples : - Genomic: DNA, protein- Sensory analysis: sensorial, physico-chemical- Comparison of coding (quantitative / qualitative)

Page 13: An R package for exploratory data analysis for teaching ...

Hierarchy on the variables (HMFA)

Two levels for the hierarchy: the first one contains L groups,each l group contains Jl subgroups, and each subgroup have Kj variables

Objective: to balance the groups and the subgroups of variables

Page 14: An R package for exploratory data analysis for teaching ...

Partition on the individuals (DMFA)

Objective: to compare the covariance matrices

Group 1

Group J

1 k K 1 i xik I1 1

IJ

Page 15: An R package for exploratory data analysis for teaching ...

3 – Graphical User Interface

Menu of the FactoMineR GUI

Page 16: An R package for exploratory data analysis for teaching ...

3 – Graphical User Interface

Main window of the PCA

Page 17: An R package for exploratory data analysis for teaching ...

3 – Graphical User Interface

Graphical options

Page 18: An R package for exploratory data analysis for teaching ...

3 – Graphical User Interface

Page 19: An R package for exploratory data analysis for teaching ...

4 – Conclusion

For researchers, practitioners and students: with classical and

advanced methods

The FactoMineR package is available on the CRAN

The GUI can be simply loaded:source("http://factominer.free.fr/install-facto.r")

A website is dedicated to this package: http://factominer.free.fr

Future: dynamical graphsPerspective: UseR!2008 (2 tutorials), UseR!2009 at Rennes

Page 20: An R package for exploratory data analysis for teaching ...

MFA example: representation of the individuals

-4 -2 0 2

-20

24

Dim 1 (49.38 %)

Dim

2 (

19.4

9 %

)

2EL

1CHA

1FON

1VAU

1DAM2BOU 1BOI

3EL

DOM11TUR

4EL PER1

2DAM

1POY1ING

1BEN2BEA

1ROC2ING

T1 T2

http://factominer.free.fr

Page 21: An R package for exploratory data analysis for teaching ...

MFA example: representation of the variables

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Correlation circle

Dim 1 (49.38 %)

Dim

2 (

19.4

9 %

)odorvisualodor.after.shak.tasteoverall Odor.Intensity.before.shaking

Aroma.quality.before.shakingFruity.before.shaking

Flower.before.shaking

Spice.before.shaking

Visual.intensityNuance

Surface.feeling

Odor.Intensity

Quality.of.odourFruity

Flower

SpicePlante

Phenolic

Aroma.intensity

Aroma.persistency

Aroma.quality

Attack.intensity

Acidity Astringency

Alcohol

BalanceSmooth

Bitterness

Intensity

Harmony

Overall.qualityTypical

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Correlation circle

Dim 1 (49.38 %)

Dim

2 (

19.4

9 %

)odorvisualodor.after.shak.tasteoverall Odor.Intensity.before.shaking

Aroma.quality.before.shakingFruity.before.shaking

Flower.before.shaking

Spice.before.shaking

Visual.intensityNuance

Surface.feeling

Odor.Intensity

Quality.of.odourFruity

Flower

SpicePlante

Phenolic

Aroma.intensity

Aroma.persistency

Aroma.quality

Attack.intensity

Acidity Astringency

Alcohol

BalanceSmooth

Bitterness

Intensity

Harmony

Overall.qualityTypical

http://factominer.free.fr

Page 22: An R package for exploratory data analysis for teaching ...

MFA example: representation of the groups

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Dim 1 (49.38 %)

Dim

2 (

19.

49 %

)

odor

visual

odor.after.shaking

taste

origin

overall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Dim 1 (49.38 %)

Dim

2 (

19.

49 %

)

odor

visual

odor.after.shaking

taste

origin

overall

http://factominer.free.fr

Page 23: An R package for exploratory data analysis for teaching ...

-6 -4 -2 0 2 4

-4-2

02

46

Individual factor map

Dim 1 (49.38 %)

Dim

2 (

19

.49

%)

2EL

1CHA

1FON

1VAU

1DAM2BOU1BOI

3EL

DOM11TUR

4EL PER1

2DAM1POY1ING

1BEN2BEA

1ROC2ING

T1 T2

odorvisualodor.after.shakingtaste

MFA example: representation of the partial points

http://factominer.free.fr

Page 24: An R package for exploratory data analysis for teaching ...

MFA example: representation of the partial points

-4 -2 0 2 4

02

46

Dim 1 (49.38 %)

Dim

2 (

19

.49

%)

Saumur

Bourgueuil

ChinonReferenceEnv1

Env2

Env4

olfvisolfaggust

http://factominer.free.fr

Page 25: An R package for exploratory data analysis for teaching ...

Unsupervised classification

1V

AU

2IN

G

3E

L

T1

T2

1D

AM

2B

EA

4E

L

PE

R1

2B

OU

1B

OI

1IN

G

2D

AM

1P

OY

1C

HA

1F

ON

1T

UR

2E

L

DO

M1

1B

EN

1R

OC

02

46

81

01

2Cluster Dendrogram for Solution HClust.2

Method=ward; Distance=euclidianObservation Number in Data Set don

He

igh

t 4 classes

http://factominer.free.fr

Page 26: An R package for exploratory data analysis for teaching ...

MFA example: representation of the individuals

-4 -2 0 2

-20

24

Dim 1 (49.38 %)

Dim

2 (

19.4

9 %

)

2EL

1CHA

1FON

1VAU

1DAM2BOU1BOI

3EL

DOM11TUR

4EL PER1

2DAM

1POY1ING

1BEN2BEA

1ROC2ING

T1

T2

classe1

classe2

classe3

classe4

classe1classe2classe3classe4

http://factominer.free.fr

Page 27: An R package for exploratory data analysis for teaching ...

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Prefmap-PLS graph between Rank and Points

Correlation between Rank and Points : -0.7392Rank

Po

ints

X100m

Long.jump

Shot.putHigh.jump

X400mX110m.hurdle

Discus

Pole.vault

Javeline

X1500m

http://factominer.free.fr

Page 28: An R package for exploratory data analysis for teaching ...

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Dimension 1 (61.49%)

Dim

ensi

on 2

(16

.46%

)

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

setosaversicolorvirginicavar

http://factominer.free.fr

MFA example: representation of the variables

Page 29: An R package for exploratory data analysis for teaching ...

0.0 0.2 0.4 0.6 0.8 1.0 1.2

0.0

0.2

0.4

0.6

0.8

1.0

Projection of the groups

Dim 1

Dim

2

setosa

versicolor

virginica

http://factominer.free.fr

Page 30: An R package for exploratory data analysis for teaching ...

-1.0 -0.5 0.0 0.5 1.0

-1.0

-0.5

0.0

0.5

1.0

Biplot between axes 1 and 2 for group versicolor

Correlation between Dim.1 and Dim.2 : 0.09613Dim.1

Dim

.2

Sepal.Length

Sepal.Width

Petal.Length

Petal.Width

http://factominer.free.fr

Page 31: An R package for exploratory data analysis for teaching ...

-3 -2 -1 0 1 2 3

-2-1

01

23

Individual factor map

Dim 1 (20.39 %)

Dim

2 (

13

.29

%)

A

GBM

O

OA

AGBMOOA

http://factominer.free.fr

Page 32: An R package for exploratory data analysis for teaching ...

-1 0 1 2

-10

12

Individual factor map

Dim 1 (20.39 %)

Dim

2 (

13

.29

%)

A

GBM

O

OA

CGHexpr

http://factominer.free.fr

Page 33: An R package for exploratory data analysis for teaching ...

-4 -2 0 2 4

-4-2

02

Individual factor map

Dim 1 (20.39 %)

Dim

2 (

13

.29

%)

A

GBM

O

OA

http://factominer.free.fr

Page 34: An R package for exploratory data analysis for teaching ...

*Dataframe

11 12

X100m

7

8

Long.jump

DecastarOlympicG

BERNARD

SEBRLE

CLAY

KARPOV

YURKOV

WARNERS

ZSIVOCZKYMcMULLEN

MARTINEAU

HERNU

BARRAS

NOOL

BOURGUIGNON

Sebrle

Clay

Karpov

Macey

Warners

Zsivoczky

Hernu

Nool

Bernard

Schwarzl

PogorelovSchoenbeck

Barras

Smith

Averyanov

Ojaniemi

Smirnov

Qi

Drews

Parkhomenko

Terek

Gomez

Turi

Lorenzo

Karlivans

Korkizoglou

Uldal

Casarsa

http://factominer.free.fr

Page 35: An R package for exploratory data analysis for teaching ...

The FactoMineR team is nearly all the time

ready to improve the package