The Art of Data Visualization

48
THE ART OF DATA VISUALISATION S Anand, Chief Data Scientist, Gramener

Transcript of The Art of Data Visualization

Page 1: The Art of Data Visualization

THE ART OF DATA VISUALISATION

S Anand, Chief Data Scientist, Gramener

Page 2: The Art of Data Visualization

THIS TALK HAS TWO PARTS

WHAT I DO IN MY

CURRENT JOB

HOW I GOT MY

CURRENT JOB

Page 3: The Art of Data Visualization

Heinlein, in connection with my story “Dreaming Is a Private Thing”, accused me, good-naturedly, of coining money out of my neuroses.

Well, whose neuroses should I make money off of?

Page 4: The Art of Data Visualization
Page 5: The Art of Data Visualization
Page 6: The Art of Data Visualization
Page 7: The Art of Data Visualization
Page 8: The Art of Data Visualization
Page 9: The Art of Data Visualization
Page 10: The Art of Data Visualization

LET’S TAKE TESCO’S GROCERIES

category title kJ rate

dairy Activia Pouring Natural Yogurt 1X950g 216 0.21

dairy Activia Pouring Strawberry Yogurt 1X950g 250 0.21

dairy Activia Pouring Vanilla Yogurt 1X950g 263 0.21

icecream Almondy Daim 400G 1804 0.75

icecream Almondy Toblerone 400G 1850 0.5

cereals Alpen 10 Pack Lite Summer Fruits Cereal Bars 210G 1222 1.57

cereals Alpen 10Pk Fruit Nut And Chocolate Cereal Bars 290G 1812 1.14

cereals Alpen Coconut And Chocolate Cereal Bars 5Pk 145G 1863 1.24

cereals Alpen Fruit And Nut With Chocolate Cereal Bar 5X29g 1812 1.24

cereals Alpen High Fruit 650G 1439 0.4

cereals Alpen Light Bars Chocolate And Orange 5X21g 1246 1.71

cereals Alpen Light Chocolate And Fudge Bar 5X21g 1264 1.71

cereals Alpen Light Sultana & Apple Bars 5Pk 105G 1197 1.71

cereals Alpen Light Summer Fruits Bars 5Pk 105G 1222 1.71

cereals Alpen No Added Sugar 1.3Kg 1488 0.31

cereals Alpen No Added Sugar 560G 1488 0.46

cereals Alpen Original 1.5Kg 1509 0.27

cereals Alpen Original Muesli 750G 1509 0.35

cereals Alpen Raspberry And Yoghurt Cereal Bars5x29g 1748 1.24

cereals Alpen Strawberry With Yoghurt Cereal Bar 5X29g 1756 1.24

dairy Alpro Natural Yofu 500G 0.28

dairy Alpro Raspberry Vanilla Yofu 4X125g 0.35

dairy Alpro Strawberry And Fof Soya Yofu 4X125g 0.35

dairy Alpro Vanilla Yofu 500G 0.28

Page 11: The Art of Data Visualization
Page 12: The Art of Data Visualization
Page 13: The Art of Data Visualization
Page 14: The Art of Data Visualization

The ShawshankRedepmption

The Godfather

The Dark Knight

Titanic

The Phantom Menace

Twilight

New Moon

Wild Wild West

Transformers

The Good, The Bad, The Ugly

12 Angry Men

7 Samurai

Taare ZameenPar

Rang De BasantiYojinbo

MORE VOTES

BETTER RATED

Many unwatched movies

Few unwatched movies

Mix of watched & unwatched

Few watched movies

Many watched movies

Movies on the IMDb

3 Idiots

Page 15: The Art of Data Visualization
Page 16: The Art of Data Visualization
Page 17: The Art of Data Visualization

We handle terabyte-size data via non-traditional analytics and visualise it in real-time.

Gramener visualises

your data

Gramener transforms your data into concise dashboardsthat make your business problem & solution visually obvious.We help you find insights quickly, based on cognitive research,and our visualisations guide you towards actionable decisions.

A data analytics and visualisation company

Page 18: The Art of Data Visualization

MOST OF WHAT I DO TODAY IS

VISUALISING DATA ANOMALIES

Page 19: The Art of Data Visualization

India’s religions

Page 20: The Art of Data Visualization

Australia’s religions

Page 21: The Art of Data Visualization
Page 22: The Art of Data Visualization

As a Data Scientist, I’m quite intrigued by anomalies, and

ANOMALIES ARE EVERYWHERE…

S Anand, Chief Data Scientist, Gramener

Page 23: The Art of Data Visualization

100

YE

AR

SO

FIN

DIA

’SW

EA

TH

ER

1901

1911

1921

1931

1941

1951

1961

1971

1981

1991

2001

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec

Page 24: The Art of Data Visualization

You don’t need sophisticated analyses for this

IT CAN BE EASY TO SPOT THEM

S Anand, Chief Data Scientist, Gramener

Page 25: The Art of Data Visualization

EDUCATION

PREDICTING MARKS

What determines a child’s marks?

Do girls score better than boys?

Does the choice of subject matter?

Does the medium of instruction matter?

Does community or religion matter?

Does their birthday matter?

Does the first letter of their name matter?

Page 26: The Art of Data Visualization

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

TN CLASS X: ENGLISH

Page 27: The Art of Data Visualization

TN CLASS X: SOCIAL SCIENCE

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Page 28: The Art of Data Visualization

TN CLASS X: MATHEMATICS

0

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Page 29: The Art of Data Visualization

ICSE 2013 CLASS XII: TOTAL MARKS

Page 30: The Art of Data Visualization

DETECTING FRAUD

“We know meter readings are incorrect, for various reasons.

We don’t, however, have the concrete proof we need to start the process of meter reading automation.

Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.

ENERGY UTILITY

Page 31: The Art of Data Visualization

BILLING FRAUD AT AN ENERGY UTILITY

This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large number of

readings are aligned with the slab boundaries.

Below is a simple histogram (or frequency distribution) of usage levels.

Each bar represents the number of customers with a customers with a

specific bill amount (in units, or KWh).

Tariffs are based on the usage slab. Someone with 101 units is billed in

full at a higher tariff than someone with 100 units. So people have a

strong incentive to stay at or within a slab boundary.

An energy utility (with over 50 million

subscribers) had 10 years worth of

customer billing data available.

Most fraud detection software failed to

load the data, and sampled data

revealed little or no insight.

This can happen in one of two ways.

First, people may be monitoring their

usage very carefully, and turn of their

lights and fans the instant their usage

hits the slab boundary.

Or, more realistically, there’s probably some level of corruption

involved, where customers pay a small sum to the meter reading staff

to ensure that it stays exactly at the slab boundary, giving them the

advantage of a lower price.

Page 32: The Art of Data Visualization

Subject Girs higher by Girls Boys

Physics 0 119 119

Chemistry 1 123 122

English 4 130 126

Computers 6 137 131

Biology 6 129 123

Mathematics 11 123 112

Language 11 152 141

Accounting 12 138 126

Commerce 13 127 114

Economics 16 142 126

PERFORMANCE: GIRLS VS BOYS

Page 33: The Art of Data Visualization

Jain

Harini

Shweta

Sneha Pooja

Ashwin

Shah

Deepti

Sanjana

Varshini

Ezhumalai

Venkatesan

Silambarasan

Pandiyan

Kumaresan

Manikandan

Thirupathi

Agarwal

Kumar

Priya

Page 34: The Art of Data Visualization

Based on the results of the 20 lakh students taking the Class XII exams at Tamil Nadu over the last 3 years, it appears that the month you were born in can make a difference of as much as 120 marks out of 1,200.

June bornsscore the lowest

The marks shoot up for Aug borns

… and peaks for Sep-borns

120 marks out of 1200 explainable by month of birth

An identical pattern was observed in 2009 and 2010…

… and across districts, gender, subjects, and class X & XII.

“It’s simply that in Canada the eligibility cutoff for age-class hockey is January 1. A boy who turns ten on January 2, then, could be playing alongside someone who doesn’t turn ten until the end of the year—and at that age, in preadolescence, a twelve-month gap in age represents an enormous difference in physical maturity.”

-- Malcolm Gladwell, Outliers

Page 35: The Art of Data Visualization

LET’S LOOK AT 15 YEARS OF US BIRTH DATA

This is a dataset (1975 – 1990) that has

been around for several years, and has

been studied extensively. Yet, a

visualization can reveal patterns that

are neither obvious nor well known.

For example,

• Are birthdays uniformly distributed?

• Do doctors or parents exercise the C-section option to move dates?

• Is there any day of the month that has unusually high or low births?

• Are there any months with relatively high or low births?

Very high births in September.

But this is fairly well known.

Most conceptions happen during

the winter holiday season

Relatively few births during the

Christmas and Thanksgiving

holidays, as well as New Year and

Independence Day.

Most people prefer not

to have children on the

13th of any month, given

that it’s an unlucky day

Some special days like April

Fool’s day are avoided, but

Valentine’s Day is quite

popular

More births Fewer births … on average, for each day of the year (from 1975 to 1990)

Page 36: The Art of Data Visualization

THE PATTERN IN INDIA IS QUITE DIFFERENTThis is a birth date dataset that’s

obtained from school admission data

for over 10 million children. When we

compare this with births in the US, we

see none of the same patterns.

For example,

• Is there an aversion to the 13th or is there a local cultural nuance?

• Are holidays avoided for births?

• Which months have a higher propensity for births, and why?

• Are there any patterns not found in the US data?

Very few children are born in the

month of August, and thereafter.

Most births are concentrated in

the first half of the year

We see a large number of

children born on the 5th, 10th,

15th, 20th and 25th of each month

– that is, round numbered dates

Such round numbered patterns a

typical indication of fraud. Here,

birthdates are brought forward

to aid early school admission

More births Fewer births … on average, for each day of the year (from 2007 to 2013)

Page 37: The Art of Data Visualization

THIS ADVERSELY IMPACTS CHILDREN’S MARKS

It’s a well established fact that older

children tend to do better at school in

most activities. Since many children

have had their birth dates brought

forward, these younger children suffer.

The average marks of children “born” on the 1st, 5th, 10th, 15th etc. of the

month tend to score lower marks.

• Are holidays avoided for births?

• Which months have a higher propensity for births, and why?

• Are there any patterns not found in the US data?

Higher marks Lower marks … on average, for children born on a given day of the year (from 2007 to 2013)

Children “born” on round numbered days score lower marks on average,due to a higher proportion of younger children

Page 38: The Art of Data Visualization

WHAT’S UNUSUAL ABOUT LOANS AFTER THE 20TH?Every loan disbursed after the 20th of the month, i.e. from the 21st to

the end of the month, shows consistently lower non-performing assets

(i.e. better quality) than any loan disbursed prior to the 20th.

The bank mapped this back to their incentive scheme. The sales team’s

commission is based only on loans disbursed until the 20th. Hence new

loans are squeezed into this period without regard for their quality.

The personal finance division of a

bank, focusing on retail loans, drove

its sales through a branch sales team.

A study of the non-performing assets

of loans generated over the course of

one year shows a strange pattern.

Analytics can detect something that you’re specifically looking for.

It takes a visual to detect what we don’t know to look for

This representation, known as a

calendar map, can show some

interesting patterns, particularly

weekday-based patterns, as the next

example will show.

5

Page 39: The Art of Data Visualization

RESTAURANT FOUND AN UNUSUAL DIP IN SALESA restaurant chain had data for every

single transaction made over a few

years. Plotting this as a time series

showed them nothing unusual.

However, the same data on a calendar

map reveals a very different story.

Specifically, at the bottom left point-of-sale terminal, sales dips on

every Wednesday. At the bottom right point-of-sale terminal, sales

rises on every Wednesday (almost as if to compensate for the loss.)

It turns out that the manager closes the bottom-left counter every

Wednesday afternoon due to shortage of staff, assuming that it results

in no loss of sales. There is, however, a net loss every Wednesday.

5

Page 40: The Art of Data Visualization

But that’s to say that simple techniques can spot everything

YOU CAN GO BEYOND “EASY”

S Anand, Chief Data Scientist, Gramener

Page 41: The Art of Data Visualization

WHAT’S SO SPECIAL ABOUT TOBACCO? 4

Page 42: The Art of Data Visualization

WHAT’S WRONG WITH THE MINERAL WATER? 3

Page 43: The Art of Data Visualization
Page 44: The Art of Data Visualization
Page 45: The Art of Data Visualization
Page 46: The Art of Data Visualization
Page 47: The Art of Data Visualization
Page 48: The Art of Data Visualization

Try it! All you need is some data and some curiosity to…

VISUALISE DATA YOURSELF!

S Anand, Chief Data Scientist, Gramener