"Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at...

31
Visualizing and Communicatin g High-dimensional Data Dr. Stefan Kühn Lead Data Scientist Data Natives Berlin - 26.10.2016

Transcript of "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at...

Page 1: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Visualizing and CommunicatingHigh-dimensional Data

Dr. Stefan KühnLead Data Scientist

Data Natives Berlin - 26.10.2016

Page 2: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Data Visualization Basics

2

Page 3: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

The Modes of Perception

3

X

X

X

X

X

XX

O

O

O O

O

O

O

OX

X

XX

O

O

O

OXX

X

X

X

X

XX

O

O

O O

O

O

O

O

X

X X

X

XO

OO

Fast Slow

Find the outlier

Page 4: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

The Modes of Perception

4

• Pre-attentive• fast• parallel processing• effortless

• Pattern recognition• semi-fast• governed by laws of Gestalt

• Attentive• slow• sequential• high effort (attention is a very limited resource)

Page 5: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Main Properties of Graphics

5

Category ExamplePosition

Shape

Size

Color

Orientation (Line)

Length (Line)

Type and Size (Line)

Brightness

Page 6: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Main Properties of Graphics Humans

6

Category Amount of pre-attentive information

Position very high

Shape ———

Size approx. 4

Color approx. 8

Orientation (Line) approx. 4

Length (Line) ———

Type and Size (Line) ———

Brightness approx. 8

Page 7: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Pre-attentive perception

7

• Position• fast• effective• high number of different positions

• Color• use with care

• Shape• Orientation

Pre-attentive perception is effortless.Exploit this as much as you can.

Page 8: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Pattern detection

8

„It is interesting to note that our brain […] subconsciously always prefers meaningful situations and objects.“

• Emergence• Reiification• Multi-stability• Invariance

Pattern detection can be trained.Exploit this for frequent visualizations.

Page 9: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

What is this?

9

Page 10: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

What is this?

10

Page 11: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

What is this?

11

Page 12: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Laws of Gestalt

12

„It is interesting to note that our brain, inaccordance with the laws of Gestalt,

subconsciously always prefers meaningful situations and objects.“

Page 13: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Accuracy of Graphics

13

Square Pie vs Stacked Bar vs Pie vs Donut

What do you think?

https://eagereyes.org/blog/2016/a-reanalysis-of-a-study-about-square-pie-charts-from-2009

Page 14: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Why do we visualize data?

14

• Explore• use different techniques• avoid „construction“ bias• be careful with „aesthetics“• challenge findings -> use attentive mode

• Explain• focus on message or „story“• use pre-attentive mode

Don’t trust graphics that you have not falsified yourself.

Page 15: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Beyond the Basics

15

Page 16: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Pair Plots

16

Page 17: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Correlation Plots

17

Page 18: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Fundamental Problems

• No accurate method in higher dimensions• Approximation methods• „Simulated“ dimensions (color, size, shape)

• Animations?

• No notion of quality or accuracy for Visualizations• Information Theory?• „Stability“?

All Visualizations are wrong, but some are useful.18

Page 19: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Approximation methods

• Pair Plots• Axis-aligned projections • Interpretable in terms of original variables

• Singular Value Decomposition• Optimal with respect to 2-norm (Euclidean norm)

and supremum norm• Comes with an error estimate

• Other methods• Stochastic Neighbor Embedding ((t-)SNE)• „Manifold Learning“

19

Page 20: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Manifold Learning Methods

• Locally Linear Embedding• Neighborhood-preserving embedding

• Isomap• quasi-isometric

• Multi-dimensional scaling• quasi-isometric

• Spectral Embedding• Spectral clustering based on similarity

• Stochastic Neighbor Embedding (SNE, t-SNE)• preserves conditional probabilities for similarity

• Local Tangent Space Alignement (LTSA)

20

Page 21: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Manifold Learning

21

Page 22: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Manifold Learning

22

Page 23: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Manifold Learning

23

Page 24: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Local Tangent Space Alignement

24

Page 25: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Principal Components and Curves

• Principal Component Analysis• orthogonal decomposition based on SVD• linear in all variables• tries to preserve variance

• Principal Curves• minimize the Sum of Squared Errors with respect

to all variables (as PCA, preserve variance)• nonlinear• smooth

25

Page 26: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Principal Components and Curves

26

Page 27: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

Parallel Coordinates

• Parallel Coordinates• especially useful for high-dimensional data• depends on ordering and scaling

27

Page 28: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

The Grand Tour

• Animated sequence of 2-D projections• https://en.wikipedia.org/wiki/

Grand_Tour_(data_visualisation)• Asimov (1985): The grand tour: a tool for viewing

multidimensional data.

• Underlying idea• Randomly generate 2-D projections (random

walk)• Over time generate a dense subset of all

possible 2-D projections• Optional: Follow a given path / guided tour

28

Page 29: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

The Grand Tour

29

Page 30: "Visualising and Communicating High Dimensional Data", Stefan Kuehn, Lead Data Scientist at codecentric AG

The Grand Tour

30