Automating Data Exploration SciPy 2016

Post on 16-Apr-2017

565 views 1 download

Transcript of Automating Data Exploration SciPy 2016

1

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

A TOOL AGNOSTIC APPROACH

2

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS

3

LET’S TAKE A DATASET

Each row has details about an employee who has left the organization.

Just “reading” the dataset is quite informative.

4

DESCRIBE THE DATA IN A STRUCTURED WAY

5

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS

6

CATEGORICAL COLUMNS YIELD VERY LITTLE DATA

There’s not much information in one column.

The values are not quantitative,so a distribution is not meaningful.

The values are not even ordered.

In fact, the only thing we have is the list of values and their count.

... or is there more to this?

Region CountIndia 10780Headstrong 1554China 1130Philippines 1030US 792Romania 788Mexico 324Guatemala 233Poland 124Brazil 45Hungary 41Colombia 38Netherlands 33South Africa 30UK 18UAE 15GMS India 15Japan 11CZECH Republic 10Kenya 9

7

... BUT RANK FREQUENCY IS STILL POSSIBLE

The rank of the row provides additional information.

With this, we can explore the distribution of the rank against the count.

These distributions are called rank-frequency distributions.

Rank Region Count1 India 107802 Headstrong 15543 China 11304 Philippines 10305 US 7926 Romania 7887 Mexico 3248 Guatemala 2339 Poland 12410 Brazil 4511 Hungary 4112 Colombia 3813 Netherlands 3314 South Africa 3015 UK 1816 UAE 1517 GMS India 1518 Japan 1119 CZECH Republic 1020 Kenya 9

8

REGION SHOWS A POWER LAW DISTRIBUTION

Region CountIndia 10780Headstrong 1554China 1130Philippines 1030US 792Romania 788Mexico 324Guatemala 233Poland 124Brazil 45Hungary 41Colombia 38Netherlands 33South Africa 30UK 18UAE 15GMS India 15Japan 11CZECH Republic 10Kenya 9

Rank on a log scale

Freq

uenc

y on

a lo

g sc

ale

9

COST CODE SHOWS A POWER LAW DISTRIBUTION

Cost Code Count105 9542121 1757125 875122 7963001 6543310 635124 435131 415115 336nan 207101 205127 173109 148116 91126 66...

10

LE SHOWS A POWER LAW DISTRIBUTION

LE CountD84 11487GPL 853RM1 789LC2 565GMR 323D95 247GUT 233ML1 223CTK 184AXE 127A38 98A21 79EMP 61BRL 45A66 43...

11

WHAT CAUSESPOWER LAW DISTRIBUTIONS?

PREFERENTIAL

ATTACHMENT

EXPONENTIAL GROWTH

12

NO. OF FOLLOWERS ON GITHUB

Username Countslidenerd 1700astaxie 1320MugunthKumar 1081honcheng 870arunoda 827csjaba 670cheeaun 658timoxley 600karlseguin 600hemanth 514arvindr21 400yuvipanda 335mbrochh 330anandology 330sayanee 314zz85 314sanand0 309captn3m0 300sameersbn 300...

13

NO. OF MOVIES ACTED IN BY BOLLYWOOD PEOPLE

Person CountLata Mangeshkar 824Asha Bhosle 810Shakti Kapoor 589Kishore Kumar 585Mohammed Rafi 527Sunidhi Chauhan 515Alka Yagnik 451Udit Narayan 435Kader Khan 430Sonu Nigam 405Sameer 398Asrani 397Helen 395Shaan 377Aruna Irani 375Anupam Kher 367Shreya Ghoshal 357Gulshan Grover 341...

14

PARTIES IN PARLIAMENT ELECTIONS

Name CountIND 44704INC 7213BJP 3354BSP 2628SP 1311CPI 1102JD 943CPM 914DDP 716JNP 676BJS 657JP 563NOTA 543PSP 538INC(I) 492SHS 467AAP 432SWA 410...

15

CANDIDATE NAMES IN ASSEMBLY ELECTIONS

Name CountNONE OF THE ABOVE 629OM PRAKASH 478ASHOK KUMAR 411RAM SINGH 362RAJ KUMAR 294ANIL KUMAR 271AMAR SINGH 248MOHAN LAL 235RAM KUMAR 224BABU LAL 218RAM PRASAD 213JAGDISH 210VIJAY KUMAR 207RAJENDRA SINGH 196VINOD KUMAR 195SHYAM LAL 193RAJESH KUMAR 186SITA RAM 186RAM LAL 171...

16

STUDENT NAMES IN SSA SURVEY

Name CountM.MANIKANDAN 99S.PAVITHRA 84S.MANIKANDAN 84R.RAMYA 82S.SANGEETHA 70R.MANIKANDAN 69S.DIVYA 68M.PAVITHRA 68S.SANTHIYA 67S.VIGNESH 67M.PRIYA 67M.MAHALAKSHMI 64S.SARANYA 63S.SURYA 60K.MANIKANDAN 60P.PAVITHRA 56S.GAYATHRI 56P.MANIKANDAN 55...

Jain

Harini

Shweta

Sneha Pooja

Ashwin

Shah

Deepti

Sanjana

Varshini

Ezhumalai

Venkatesan

Silambarasan

Pandiyan

Kumaresan

Manikandan

Thirupathi

Agarwal

Kumar

Priya

18

NOT EVERYTHING IS POWER-LAW, THOUGH

Need to understand what drives these distributions from their behaviours

19

ORDERED CATEGORICALS HAVE MORE INFORMATION

20

CORPORATE BAND

LE Count5 122474 44493 2052 63Not Mapped 241 22SVP 10

21

LOCAL BAND

LE Count5A 74835B 47644A 16834B 16124C 7474D 4073 2052 63Not Mapped 241 22SVP 10

22

QUANTITIES HAVE EVEN MORE INFORMATION

23

AGE DISTRIBUTION IS LOG-NORMAL

24

DETECTING FRAUD

“ We know meter readings are incorrect, for various reasons.

We don’t, however, have the concrete proof we need to start the process of meter reading automation.

Part of our problem is the volume of data that needs to be analysed. The other is the inexperience in tools or analyses to identify such patterns.

ENERGY UTILITY

25

This plot shows the frequency of all meter readings from Apr-2010 to Mar-2011. An unusually large

number of readings are aligned with the tariff slab boundaries.

This clearly shows collusion of some form with the customers.

Apr-10 May-10 Jun-10 Jul-10 Aug-10 Sep-10 Oct-10 Nov-10 Dec-10 Jan-11 Feb-11 Mar-11217 219 200 200 200 200 200 200 200 350 200 200250 200 200 200 201 200 200 200 250 200 200 150250 150 150 200 200 200 200 200 200 200 200 150150 200 200 200 200 200 200 200 200 200 200 50200 200 200 150 180 150 50 100 50 70 100 100100 100 100 100 100 100 100 100 100 100 110 100100 150 123 123 50 100 50 100 100 100 100 100

0 111 100 100 100 100 100 100 100 100 50 500 100 27 100 50 100 100 100 100 100 70 1001 1 1 100 99 50 100 100 100 100 100 100

This happens with specific customers, not randomly. Here are such customers’ meter readings.

Section

Apr-10

May-10

Jun-10

Jul-10

Aug-10Sep-10

Oct-10Nov-10

Dec-10

Jan-11

Feb-11

Mar-11

Section 1 70% 97% 136% 65% 110% 116% 121% 107% 114% 88% 74% 109%Section 2 66% 92% 66% 87% 70% 64% 63% 50% 58% 38% 41% 54%Section 3 90% 46% 47% 43% 28% 31% 50% 32% 19% 38% 8% 34%Section 4 44% 24% 36% 39% 21% 18% 24% 49% 56% 44% 31% 14%Section 5 4% 63% -27% 20% 41% 82% 26% 34% 43% 2% 37% 15%Section 6 18% 23% 30% 21% 28% 33% 39% 41% 39% 18% 0% 33%Section 7 36% 51% 33% 33% 27% 35% 10% 39% 12% 5% 15% 14%Section 8 22% 21% 28% 12% 24% 27% 10% 31% 13% 11% 22% 17%Section 9 19% 35% 14% 9% 16% 32% 37% 12% 9% 5% -3% 11%

If we define the “extent of fraud” as the percentage excess of the 100 unitmeter reading, the value varies considerably across sections, and time

New section manager arrives

… and is transferred out

… with some explainable anomalies.

Why would these

happen?

26

PREDICTING MARKS

“ What determines a child’s marks?

Do girls score better than boys?

Does the choice of subject matter?

Does the medium of instruction matter?

Does community or religion matter?

Does their birthday matter?

Does the first letter of their name matter?

EDUCATION

27

TN CLASS X: ENGLISH

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

28

TN CLASS X: SOCIAL SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

29

TN CLASS X: LANGUAGE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

30

TN CLASS X: SCIENCE

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

31

TN CLASS X: MATHEMATICS

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 990

5,000

10,000

15,000

20,000

25,000

30,000

35,000

40,000

32

ICSE 2013 CLASS XII: TOTAL MARKS

33

CBSE 2013 CLASS XII: ENGLISH MARKS

34

CBSE 2013 CLASS XII: PHYSICS MARKS

35

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS

36

LET’S TAKE ONE DAY CRICKET DATA

Country Player Runs ScoreRate MatchDate Ground VersusAustralia Michael J Clarke 99* 93.39 30-06-2010 The Oval EnglandAustralia Dean M Jones 99* 128.57 28-01-1985 Adelaide Oval Sri LankaAustralia Bradley J Hodge 99* 115.11 04-02-2007 Melbourne Cricket Ground New ZealandIndia Virender Sehwag 99* 99 16-08-2010 Rangiri Dambulla International Stad. Sri LankaNew Zealand Bruce A Edgar 99* 72.79 14-02-1981 Eden Park IndiaPakistan Mohammad Yousuf 99* 95.19 15-11-2007 Captain Roop Singh Stadium IndiaWest Indies Richard B Richardson 99* 70.21 15-11-1985 Sharjah CA Stadium PakistanWest Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002 Sardar Patel Stadium IndiaZimbabwe Andrew Flower 99* 89.18 24-10-1999 Harare Sports Club AustraliaZimbabwe Alistair D R Campbell 99* 79.83 01-10-2000 Queens Sports Club New ZealandZimbabwe Malcolm N Waller 99* 133.78 25-10-2011 Queens Sports Club New ZealandAustralia David C Boon 98* 82.35 08-12-1994 Bellerive Oval ZimbabweAustralia Graeme M Wood 98* 63.22 11-01-1981 Melbourne Cricket Ground IndiaEngland Ian J L Trott 98* 84.48 20-10-2011 Punjab Cricket Association Stadium IndiaIndia Yuvraj Singh 98* 89.09 01-08-2001 Sinhalese Sports Club Ground Sri LankaIreland Kevin J O'Brien 98* 94.23 10-07-2010 VRA Ground ScotlandKenya Collins O Obuya 98* 75.96 13-03-2011 M.Chinnaswamy Stadium AustraliaNetherlands Ryan N ten Doeschate 98* 73.68 01-09-2009 VRA Ground AfghanistanNew Zealand James E C Franklin 98* 142.02 07-12-2010 M.Chinnaswamy Stadium IndiaPakistan Ijaz Ahmed 98* 112.64 28-10-1994 Iqbal Stadium South AfricaSouth Africa Jacques H Kallis 98* 74.24 06-02-2000 St George's Park Zimbabwe

37

Against which countries are higher averages

scored?

Which countries’ players score more per

match?

38

Which player scores the most per ball?

The player with the highest strike rate is an obscure South African whose name most of us have never heard of.

In fact, this list is filled with players we have never heard of.

39

Most analysis answers the question

“Which is are the top 10 X”?Which are my top products?

Which are my top branches?

Who are my best sales people?

Which vendors have the highest cost per unit?

Which divisions are spending the most money?

In which hours does the under 12 segment watch TV most?

Which customer segment has the highest revenue per user?

40

THIS QUESTION CAN BE ANSWERED SYSTEMATICALLY

Country Player Runs ScoreRate MatchDate Ground VersusAustralia Michael J Clarke 99* 93.39 30-06-2010 The Oval EnglandAustralia Dean M Jones 99* 128.57 28-01-1985 Adelaide Oval Sri LankaAustralia Bradley J Hodge 99* 115.11 04-02-2007 Melbourne Cricket Ground New ZealandIndia Virender Sehwag 99* 99 16-08-2010 Rangiri Dambulla International Stad. Sri LankaNew Zealand Bruce A Edgar 99* 72.79 14-02-1981 Eden Park IndiaPakistan Mohammad Yousuf 99* 95.19 15-11-2007 Captain Roop Singh Stadium IndiaWest Indies Richard B Richardson 99* 70.21 15-11-1985 Sharjah CA Stadium PakistanWest Indies Ramnaresh R Sarwan 99* 95.19 15-11-2002 Sardar Patel Stadium IndiaZimbabwe Andrew Flower 99* 89.18 24-10-1999 Harare Sports Club AustraliaZimbabwe Alistair D R Campbell 99* 79.83 01-10-2000 Queens Sports Club New ZealandZimbabwe Malcolm N Waller 99* 133.78 25-10-2011 Queens Sports Club New ZealandAustralia David C Boon 98* 82.35 08-12-1994 Bellerive Oval ZimbabweAustralia Graeme M Wood 98* 63.22 11-01-1981 Melbourne Cricket Ground IndiaEngland Ian J L Trott 98* 84.48 20-10-2011 Punjab Cricket Association Stadium IndiaIndia Yuvraj Singh 98* 89.09 01-08-2001 Sinhalese Sports Club Ground Sri LankaIreland Kevin J O'Brien 98* 94.23 10-07-2010 VRA Ground ScotlandKenya Collins O Obuya 98* 75.96 13-03-2011 M.Chinnaswamy Stadium AustraliaNetherlands Ryan N ten Doeschate 98* 73.68 01-09-2009 VRA Ground AfghanistanNew Zealand James E C Franklin 98* 142.02 07-12-2010 M.Chinnaswamy Stadium IndiaPakistan Ijaz Ahmed 98* 112.64 28-10-1994 Iqbal Stadium South AfricaSouth Africa Jacques H Kallis 98* 74.24 06-02-2000 St George's Park Zimbabwe

Take every column in the data

Find the top value by that column

Country South Africa has the highest strike rate of 76%Player Johann Louw has the highest strike rate of 329%Runs 164 runs has the highest strike rate of 156%MatchDate12-03-2006 has the highest strike rate of 136%Ground AC-VDCA Stadium has the highest strike rate of98%Versus United States has the highest strike rate of 104%

41

What do the children in schools know and can do at different stages of elementary

education?

Have the inputs made into the elementary education system had a beneficial effect or

not?

42

HAVING BOOKS IMPROVES READING ABILITYHaving more books at home improves the performance of children when it comes to reading. (But children typically only have only 1-10 books at home) Number of students sampled

What is the impact? How many more marks can having more books fetch?

Circle size indicates number of students with this response. Few students have no books.

Is this response (“25+ books”) good or bad? Small red bars indicate low marks. Large green bars indicate high marks. Students having 25+ books tend to score high marks.

The most common response is marked in blue. This is also the circle.

The graphic is summarized in words

Indicates whether the best response is the most popular. Blue means that it is not. Green means that it is. Red means that the worst level is the most popular response.

43

CHILDREN LIKE GAMES, AND THEY’RE GOOD

… but playing daily hurts reading ability

44

WATCHING TV OCCASIONALLY IS GOODChildren who watch TV every day don’t do as well as children who watch TV only once a week.

But children who never watch TV fare the worst.

Watching TV every day helps improve children’s reading ability a little bit more…

… but mathematical abilities fall dramatically at that point

45

WE HAVE A WEBSITE THAT YOU CAN EXPLORE

GRAMENER.COM/NAS

46

AUTOMATING DATA EXPLORATIONA structured approach to analysing data

METADATA UNIVARIATE ANALYSIS

BIVARIATE ANALYSIS