Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science...

14
Data Mining and Exploration Michael Gutmann [email protected] http://homepages.inf.ed.ac.uk/mgutmann Institute for Adaptive and Neural Computation School of Informatics, University of Edinburgh 19th January 2017 Michael Gutmann DME 1 / 14

Transcript of Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science...

Page 1: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data Mining and Exploration

Michael Gutmann

[email protected]

http://homepages.inf.ed.ac.uk/mgutmann

Institute for Adaptive and Neural ComputationSchool of Informatics, University of Edinburgh

19th January 2017

Michael Gutmann DME 1 / 14

Page 2: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data

Oxford dictionary:I Plural of datum

I From Latin: dare, to give; datum: something givenI A piece of information

I Facts [...] collected together for reference or analysis

I Things [...] making the basis of reasoning or calculation

Michael Gutmann DME 2 / 14

Page 3: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

by Frederic Dorr Steele

“Data! Data! Data!” he cried impatiently.

“I can’t make bricks without clay”

Sherlock Holmes

Page 4: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data sources

I Scientific measurements

I Business records

I Medical tests

I Paying by credit card

I Using the mobile phone

I Social media

I Machines

I . . .

Michael Gutmann DME 4 / 14

Page 5: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Scientific data

Large Hadron Collider:

I Particles collide at high energies, creating new particles thatdecay in complex ways

I The raw data per collision event is around one MB.

I About 600 million events per second.

⇒ 600 terabyte of data per second

Source: https://home.cern/about/computing

Michael Gutmann DME 5 / 14

Page 6: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Human generated data

On a single day

I 500 million tweets

I 4.3 billion Facebook messages

I 6 billion Google searches

I 205 billion emails

I . . .

Source: https://www.gwava.com/blog/internet-data-created-daily

Michael Gutmann DME 6 / 14

Page 7: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Human generated data

Source: https://www.domo.com/blog/data-never-sleeps-4-0

Michael Gutmann DME 7 / 14

Page 8: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Machine generated data

I Airplane engine: 5,000 sensors, 10 GB of data per secondI Internet of Things

Sources: From Machine-To-Machine to the Internet of Things, Ch 2, 2014; aviationweek.com/connected-aerospace/internet-aircraft-things-industry-set-be-transformed

Michael Gutmann DME 8 / 14

Page 9: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data mining ≈ data analysis ≈ data science

First sentences from corresponding wikipedia pages:

I The overall goal of the data mining process is to extractinformation from a data set and transform it into anunderstandable structure for further use

I Analysis of data is a process of [...] with the goal ofdiscovering useful information, suggesting conclusions, andsupporting decision-making

I Data science [...] is an interdisciplinary field about scientificprocesses and systems to extract knowledge or insights fromdata in various forms [...]

Michael Gutmann DME 9 / 14

Page 10: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data mining ≈ data analysis ≈ data science

In short:

I Data −→ knowledge

I Evidence −→ conclusions

I Pieces of information −→ actionable information

I The process of “making the bricks out of the clay”

Michael Gutmann DME 10 / 14

Page 11: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data analysis as statistical inference

Given a data generating process, what are the properties of the outcomes (the data)?

Given the outcomes (the data), what can we say about the process that generated them?

(data source)

Based on Figure 1 of All of statistics by Larry Wasserman

Michael Gutmann DME 11 / 14

Page 12: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data analysis as statistical inference

Given a data generating process, what are the properties of the outcomes (the data)?

Given the outcomes (the data), what can we say about the process that generated them?

(data source)

Data are a realisation of a random vector x with some probabilitydistribution that we don’t know.

Michael Gutmann DME 12 / 14

Page 13: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Data analysis process

Get (raw) data

Deploy theproduct /

Communicatefindings

Sanity checks

- where are you now?- what do you want to do?- constraints?

- understand the sampling process- any biases?- feeback loops between data analysis and collection?

- become familiar with the data- spot unexpected properties- anomalies, outliers, missing data?

Exploratorydata analysis

- merge data sets, reformat- select/exclude data- provide clear rationale for selection/exclusion

Prep data forfurther analysis

Build andfit model

- generalisation is the goal- choice of evaluation metric- choice of hyperparameters

Summarise, vis-ualise results

- can you tell a simple and coherent story?- what makes sense, what not?- limitations, uncertainties?

Objectives andkey results

Michael Gutmann DME 13 / 14

Page 14: Data Mining and Exploration - School of Informatics · Data mining ˇdata analysis ˇdata science First sentences from corresponding wikipedia pages: I The overall goal of the data

Plan for DME

Get (raw) data

Lectures 1-3

Lecture 4

Lecture 5

Mini-project

Presentations

Deploy theproduct /

Communicatefindings

Sanity checks

- where are you now?- what do you want to do?- constraints?

- understand the sampling process- any biases?- feeback loops between data analysis and collection?

- become familiar with the data- spot unexpected properties- anomalies, outliers, missing data?

Exploratorydata analysis

- merge data sets, reformat- select/exclude data- provide clear rationale for selection/exclusion

Prep data forfurther analysis

Build andfit model

- generalisation is the goal- choice of evaluation metric- choice of hyperparameters

Summarise, vis-ualise results

- can you tell a simple and coherent story?- what makes sense, what not?- limitations, uncertainties?

Objectives andkey results

Michael Gutmann DME 14 / 14