Big Data Workshopbettina.berendt/... · • Group 1: You are a fitbit-wristband / smart-home...

Post on 11-Aug-2020

0 views 0 download

Transcript of Big Data Workshopbettina.berendt/... · • Group 1: You are a fitbit-wristband / smart-home...

Big Data

Workshop

Bettina Berendt Department of Computer Science KU Leuven, Belgium http://people.cs.kuleuven.be/~bettina.berendt/ St. John's International School April 23rd, 2018, Waterloo, Belgium

‹#›

2

2

Who am I?

3

Goals and non-goals

• Goals

▫ Talk about Big Data as a critical data scientist

▫ On a background of what science is & what

“critical“ means in this context

▫ Involve you in being critical and constructive

• Non-goals (selection)

▫ Go into depth about privacy and data protection

– although these topics are unavoidable in the Big

Data context

3

Big Data is ...

(from Alexandra Roche and Josefine Droste’s

presentation)

‹#›

5

Big Data is …

• “the growth in the volume of structured and

unstructured data, the speed at which it is

created and collected, and the scope of how

many data points are collected”

• Potential for personalizing learning

• Inherits bias

• Surveillance

• Ethical dilemmas

• Transparency (pro and con), privacy

(Alexandra Roche & Josefine Droste)

5

Science and being critical are ... ‹#›

7

What is science? (1)

• A systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.

• the word "science" became increasingly associated with what is today known as the scientific method, a structured way to study the natural world.

• Contemporary science is typically subdivided into the natural sciences which study nature in the broadest sense, the social sciences which study people and societies, and the formal sciences like mathematics which study abstract concepts. […] Disciplines which use science like engineering and medicine may also be considered to be applied sciences.

• Science is related to research, and is normally organized by a university, a college, or a research institute.

(Wikipedia: “Science”) 7

8

(1st part of pic)

8

9

What is science? (2)

“Wissenschaft ist, wenn man genauer nachfragt.”

˜Science happens when you ask again, and ask

more precisely.

(author unknown to me)

9

10

(1st part of pic)

10

Big Data is ...

… something we usually encounter via

statements

‹#›

12

Typical Big Data statements (fictitious, but true to style)

① The average Belgian pupil now spends 3 hours

a day chatting.

② Pupils who spend more than 3 hours a day

chatting “like” Converse sneakers and Dunkin

Donuts.

③ People who “like” Converse and Dunkin

Donuts are less intelligent.

12

13 Typical BD statements (4):

From Psychometrics Centre 2013

to Cambridge Analytica 2016 13

14

14

15

Typical Big Data statements (5) (from the CEM Brochure)

• Maximise learning potential • The CEM IBE computer-adaptive assessment provides an

excellent research-informed baseline to help you predict future performance (in IB Diploma examinations for each subject)

• The CEM IBE computer-adaptive assessment measures students on three key cognitive areas which research shows are linked to later academic outcomes: maths, vocabulary, non-verbal

• Once you have students’ final IB Diploma results, you can return this data to us

• The full CEM IBE product includes additional … questionnaires aiming to understand your students’ motivations, interests and aspirations. (questions about views on cultural background, way of life, social status, …)

15

16

So how …

• … can we understand such statements

scientifically?

• … can we criticise them scientifically?

16

Big Data is ...

… data ‹#›

18

“Data speak for themselves.“

• “With enough data, the numbers speak for

themselves.” Anderson, C. (2008).

• “Quantitative data [...] are independent of

interpretation; [...] they often demand an

interpretation that transcends the quantitative

realm.“ Moretti, F. (2007), p.30

18

19

Data?

• datum = given

• “data refer to those elements that are taken

[abstracted from phenomena]: extracted

through observations, computations,

experiments, and record keeping”, “selected

from nature by the scientist in accordance with

his [sic] purpose” (Kitchin, 2014)

Capta! 19

20

Impact of measure-

ment methods

20

21

Who or what “speaks“?

Who or what “decides“?

21

22

Summary:

Data cannot speak for themselves • All data are not given (by nature), but taken

(by a researcher or other data collector) ▫ With conscious or unconscious purposes/agendas

▫ In some context

• Data and analyses of them require interpretation

• Big Data are samples too

• All data have quality issues; in Big Data, we often do not know these

• Combining datasets can introduce biases and errors

22

23

Parking lot science

23

24

Some more examples of data biases

and parking lot science • Facebook likes, real-world likes

• Facebook self-presentation: only the good things ...

• Restrictions on search in Twitter

Research focus on current and recent events?!

• “Trending topics“ algorithm in Twitter based on burstiness

Suppression of persistent topics?!

24

Big Data is ...

… statistics

(on steroids)

‹#›

26

What should you ask a statistic?

26

27

What should you ask this statement?

The average Belgian pupil now spends 3 hours a day

chatting.

27

28

How to talk back to a statistic (1)

(building on Huff’s final chapter)

1. Who says so?

2. How do they know?

▫ How were data collected and analysed?

▫ In which contexts?

3. Did somebody change the subject?

▫ What are the actual data?

4. Does it make sense?

28

29

So …?

1. Who says so?

2. How do they know?

▫ How were data collected and analysed?

3. Did somebody change the subject?

▫ What are the actual data?

4. Does it make sense?

29

The average Belgian pupil now spends 3 hours a day chatting.

30

Huff’s questions in more detail 1. Who says so?

▫ What could be their conscious or unconscious biases? ▫ Do they use unqualified words (“average”: mean, median, …?) ▫ Do they use OK names? (“The survey results from scientists from the

University of … show …”)

2. How do they know? ▫ Sample size, selection bias? ▫ Correlation size, significance? ▫ Baseline values? ▫ Did external factors change? E.g. frequency of reporting?

3. Did somebody change the subject? / What are the actual data? ▫ Observation or self-report? ▫ Change over time or across data sets in how basic measures are defined ▫ Correlation or causation?

4. Does it make sense? ▫ Be wary of “exact-sounding numbers” (40.13 Euros to eat per week,

average family with 3.5 children) ▫ extrapolation

30

31

Empiricism and apophenia

31

32

Empiricism and apophenia: correlation, causation, and instrumentality

32

33

Correlation vs. causation

• The current scientific consensus is that the only

way to properly demonstrate causation is to do

an experiment.

• Many Big Data sets – especially those

concerning people – are not experimental data,

because they have been collected as

observations in the field, in all the diverse

contexts in which people operate.

• This means they can only show correlation.

33

34

How to talk back to a statistic (2)

1. Who says so?

2. How do they know?

▫ How were data collected and analysed?

3. Did somebody change the subject?

▫ What are the actual data?

▫ Correlation or causation?

4. Does it make sense?

34

35

“Correlation replaces causation“?!

(1) Good enough for business logic

35

36

Correlation replaces causation?!

(2) But deficient for explanation (can we really explain

German history like this?)

36

37

Correlation replaces causation?!

(3) What about predictions that affect someone‘s self-

image?

37

38

Questions you should ask any inferential

statistic (e.g., prediction models)

38

• How good is the model?

• There are many relevant measures of

“goodness”.

• In the following, only a small selection.

39

What is the measure,

and is it statistically significant?

39

[figure caption, from paper]

• Prediction accuracy of

regression for numeric

attributes and traits

expressed by the Pearson

correlation coefficient

between predicted and

actual attribute values;

• all correlations are

significant at the P < 0.001

level.

• The transparent bars

indicate the questionnaire’s

baseline accuracy,

expressed in terms of test–

retest reliability.

40

But what does the correlation value

itself say?

40 (Wikipedia: “Correlation”)

41

But what does the correlation value

itself say?

41 (Wikipedia: “Correlation”)

42

How is a classification model built?

42

43

How is a classification model built?

43

44

How good is the model? (= How is a classification model evaluated?) confusion matrix

44

45

How good?

45

Overall accuracy = (4+900)/1010 = 89.5% Precision for “criminals” = 4/104 = 3.8% Recall for “criminals” = 4/10 = 40% Accuracy of model “always innocent” = 1000/1010 = 99%

46

How to talk back to a statistic (3)

1. Who says so?

2. How do they know?

▫ How were data collected and analysed?

▫ How good is the model?

3. Did somebody change the subject?

▫ What are the actual data? Correlation or

causation?

4. Does it make sense?

46

47

Recap (from the CEM Brochure)

• Maximise learning potential • The CEM IBE computer-adaptive assessment provides an

excellent research-informed baseline to help you predict future performance (in IB Diploma examinations for each subject)

• The CEM IBE computer-adaptive assessment measures students on three key cognitive areas which research shows are linked to later academic outcomes: maths, vocabulary, non-verbal

• Once you have students’ final IB Diploma results, you can return this data to us

• The full CEM IBE product includes additional … questionnaires aiming to understand your students’ motivations, interests and aspirations. (questions about views on cultural background, way of life, social status, …)

47

48

How to talk back to a statistic (4)

1. Who says so?

2. How do they know?

▫ How were data collected and analysed?

▫ How good is the model?

3. Did somebody change the subject?

▫ What are the actual data? Correlation or

causation?

4. Does it make sense?

5. What is actually being claimed?

48

49

Accumulation of errors 49

… and if they see this ad, they will vote for Trump

Statistical model 1

Statistical model 1

Big Data is ...

… business models ‹#›

51

Recap (from the CEM Brochure)

• Maximise learning potential • The CEM IBE computer-adaptive assessment provides an

excellent research-informed baseline to help you predict future performance (in IB Diploma examinations for each subject)

• The CEM IBE computer-adaptive assessment measures students on three key cognitive areas which research shows are linked to later academic outcomes: maths, vocabulary, non-verbal

• Once you have students’ final IB Diploma results, you can return this data to us

• The full CEM IBE product includes additional … questionnaires aiming to understand your students’ motivations, interests and aspirations. (questions about views on cultural background, way of life, social status, …)

51

52

How to talk back to a statistic (5)

1. Who says so? ▫ What (else) are they interested in?

2. How do they know? ▫ How were data collected and analysed?

▫ How good is the model?

3. Did somebody change the subject? ▫ What are the actual data? Correlation or

causation?

4. Does it make sense?

5. What is actually being claimed?

52

53

NB: Can I see my data?

What if it’s wrong?

• You have data access rights (and other rights)

under European data protection legislation.

• But that’s another workshop …

53

Big Data is ...

… an understanding of the past used to justify what some decision maker wants to do in the future.

(Geoffrey Rockwell,

personal communication, cited from memory)

‹#›

55

Which brings us to …

• … the 2nd meaning of “critical” in science

• “Critical theory” (Habermas, Adorno, …) ▫ (social) science as a practical philosophy aiming

at societal change with the goal of increasing the autonomy / self-determination of people

▫ (A view of “critical” not as widely shared as the first one)

Here:

• Is data the only answer?

• What is the question?

55

Let’s be practical philosophers

and scientists

… and we’ll use a different example now ‹#›

57

Belgium:

top?

57

http://ec.europa.eu/eurostat/tgm/refreshTableAction.do?tab=table&plugin=1&pcode=ten00063&language=en

58

Belgium: flop?

58

59

One reason:

Belgians don’t

excel at sorting

waste

59

60

Group work!

• Group 1: You are a fitbit-wristband / smart-home company and want to develop predictive analytics for identifying who will have problems separating their trash properly, in order to give them helpful alerts. You may use any data you want. Prepare a pitch for your business model.

• Group 2: You are a company that wants to use Big Data, but avoid processing personal data. Develop an idea for how to best use these data. Prepare a pitch for your business model.

• Group 3: You are a civil society organisation that wants to

improve the trash situation without recourse to Big Data. Prepare a pitch for your idea.

60

61

Note 1: Definition of “recycling rate”

Recycling rates for packaging waste (in %)

'Recycling rate' for the purposes of Article 6(1) of

Directive 94/62/EC means the total quantity of

recycled packaging waste, divided by the total

quantity of generated packaging waste.

http://ec.europa.eu/eurostat/web/products-

datasets/product?code=ten00063

61

62

Note 2: Recycling science

62

63

Some more ideas

63

64

Shops

64

65

Re-

use

65

66

Activists

66

67

“Science

activists”

67

68

Group work!

• Group 1: You are a fitbit-wristband / smart-home company and want to develop predictive analytics for identifying who will have problems separating their trash properly, in order to give them helpful alerts. You may use any data you want. Prepare a pitch for your business model.

• Group 2: You are a company that wants to use Big Data, but avoid processing personal data. Develop an idea for how to best use these data. Prepare a pitch for your business model.

• Group 3: You are a civil society organisation that wants to

improve the trash situation without recourse to Big Data. Prepare a pitch for your idea.

68

Thank you!

Questions? Email me!

http://people.cs.kuleuven.be/~bettina.berendt/

‹#›

70

References

70

• Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired 16.07. Available at http://edge.org/3rd_culture/anderson08/anderson08_index.html

• pp. 42ff: Degeling, M. & Berendt, B. (2017). What is wrong about Robocops as consultants? A technology-centric critique of predictive policing. AI & Society. May 2017 Online First.

• pp. 8, 10: Huber, O. (). Das psychologische Experiment: Eine Einführung.

• Huff, D. (1954). How to Lie with Statistics. New York: W.W. Norton & Company, Inc.

• Kitchin, R. (2014). The Data Revolution. Big Data, Open Data, Data Infrastructures & Their Consequences. London: Sage.

• p. 13, 37, 39: Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences, 110 (15), 5802–5805.

• Moretti, F. (2005). Graphs, Maps, Trees. Abstract Models for Literary History. p.30 London: Verso (cited from the paperback published in 2007)

• pp. 13f, 49: www.theguardian.com/commentisfree/2018/mar/20/brenda-the-civil-disobedience-penguin-on-cambridge-analytica-the-real-was-getting-caught

• pp. 31f.: From http://www.tylervigen.com/spurious-correlations

• Further sources on the slides themselves.

• My apologies for having mislaid some photo/picture URLs, and thanks to those who provide(d) them online!

Not cited, but also potentially interesting:

• Berendt, B. (2015). Big Capta, Bad Science? On two recent books on “Big Data” and its revolutionary potential. http://people.cs.kuleuven.be/~bettina.berendt/Reviews/BigData.pdf

• boyd, d. & Crawford, K. (2012). Critical questions for Big Data. Information, Communication & Society, 15:5, 662-679, DOI: 10.1080/1369118X.2012.678878.