Talk Big Data Conference Munich - Data Science needs real Data Scientists.

22
Laboratory for Web Science Dept. Informatics University of Applied Sciences of Switzerland (FFHS) http://lwsffhs.wordpress.com http://lws.ffhs.ch Follow @blattnerma Dr. Marcel Blattner Laboratory for Web Science, Dr. Marcel Blatter

description

How to hire a real Data Scientist? Data Science and Big Data are hypes. It has become very sexy to be a Data Scientist. More and more self-appointed Data Scientist are found on the market. To be sure to get a real one you have to test him/her.

Transcript of Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Page 1: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science���Dept. Informatics���

University of Applied Sciences���of Switzerland (FFHS) ���

���http://lwsffhs.wordpress.com���

http://lws.ffhs.ch ������

Follow @blattnerma������

Dr. Marcel Blattner

Laboratory for Web Science, Dr. Marcel Blatter

Page 2: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science, Dr. Marcel Blatter

What is a real Data Scientist?

Page 3: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science, Dr. Marcel Blatter

“Knowing the name of something does not mean

to know something” ― Richard P. Feynman

Page 4: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science, Dr. Marcel Blatter

business relevant questions

fetch’n store data

data normalization

feature engineering /

modelling

testing / model assessment

knowledge generation / visualisation

Data Scientist

data driven tasks – what Data Scientists should do

Page 5: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science, Dr. Marcel Blatter

number crunching

business relevant questions

fetch’n store data

data normalization

feature engineering /

modelling

testing / model assessment

knowledge generation / visualisation

Data Scientist

data driven tasks – what Data Scientists should do

Page 6: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science, Dr. Marcel Blatter

number crunching

hum

an in

terp

reta

tion

business relevant questions

fetch’n store data

data normalization

feature engineering /

modelling

testing / model assessment

knowledge generation / visualisation

Data Scientist

data driven tasks – what Data Scientists should do

Page 7: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

skill-cloud and a data-hero

Laboratory for Web Science, Dr. Marcel Blatter

Page 8: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

skill-cloud and a data-hero

Laboratory for Web Science, Dr. Marcel Blatter

DH

Page 9: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

skill-cloud and a data-hero

Laboratory for Web Science, Dr. Marcel Blatter

DH

This guy lives in the land of

OZ

Page 10: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

the birth of a self-appointed data-hero

Laboratory for Web Science, Dr. Marcel Blatter

Page 11: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

…do not hire a self-appointed data-hero…

Laboratory for Web Science, Dr. Marcel Blatter

self-appointed data-hero’s recommender engine

Page 12: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

…do not hire a self-appointed data-hero…

Laboratory for Web Science, Dr. Marcel Blatter

You did like this

self-appointed data-hero’s recommender engine

Page 13: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

…do not hire a self-appointed data-hero…

Laboratory for Web Science, Dr. Marcel Blatter

You did like this

…then you might like that one as well

self-appointed data-hero’s recommender engine

Page 14: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data science is interdisciplinary…you need a team

Laboratory for Web Science, Dr. Marcel Blatter

HAC=hacking/tech. AN=analytics/math/stats ST/KO=strategic/communicator

T E A

M

Page 15: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data science is interdisciplinary…you need a team

Laboratory for Web Science, Dr. Marcel Blatter

HAC=hacking/tech. AN=analytics/math/stats ST/KO=strategic/communicator

T E A

M

common language

is key

Page 16: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data scientists should be scientists

Laboratory for Web Science, Dr. Marcel Blatter

objectivity

falsifiability

reproducibility

Met

hodo

logy

Page 17: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data scientist – test it!

Laboratory for Web Science, Dr. Marcel Blatter

Challenge the candidate

•  Real business data •  Kaggle competition •  Artificially generated data

Page 18: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data scientist – test it!

Laboratory for Web Science, Dr. Marcel Blatter

Yule-Simpson effect

Page 19: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data scientist – test it!

Laboratory for Web Science, Dr. Marcel Blatter

curse of dimensionality

Page 20: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data scientist – test it!

Laboratory for Web Science, Dr. Marcel Blatter

bias-variance tradeoff

4. DATA ALONE IS NOT ENOUGHGeneralization being the goal has another major consequence:data alone is not enough, no matter how much of it you have.Consider learning a Boolean function of (say) 100 variablesfrom a million examples. There are 2100 − 106 exampleswhose classes you don’t know. How do you figure out whatthose classes are? In the absence of further information,there is just no way to do this that beats flipping a coin. Thisobservation was first made (in somewhat different form) bythe philosopher David Hume over 200 years ago, but eventoday many mistakes in machine learning stem from failingto appreciate it. Every learner must embody some knowl-edge or assumptions beyond the data it’s given in order togeneralize beyond it. This was formalized by Wolpert inhis famous “no free lunch” theorems, according to which nolearner can beat random guessing over all possible functionsto be learned [25].

This seems like rather depressing news. How then can weever hope to learn anything? Luckily, the functions we wantto learn in the real world are not drawn uniformly fromthe set of all mathematically possible functions! In fact,very general assumptions—like smoothness, similar exam-ples having similar classes, limited dependences, or limitedcomplexity—are often enough to do very well, and this is alarge part of why machine learning has been so successful.Like deduction, induction (what learners do) is a knowledgelever: it turns a small amount of input knowledge into alarge amount of output knowledge. Induction is a vastlymore powerful lever than deduction, requiring much less in-put knowledge to produce useful results, but it still needsmore than zero input knowledge to work. And, as with anylever, the more we put in, the more we can get out.

A corollary of this is that one of the key criteria for choos-ing a representation is which kinds of knowledge are easilyexpressed in it. For example, if we have a lot of knowledgeabout what makes examples similar in our domain, instance-based methods may be a good choice. If we have knowl-edge about probabilistic dependencies, graphical models area good fit. And if we have knowledge about what kinds ofpreconditions are required by each class, “IF . . . THEN . . .”rules may be the the best option. The most useful learnersin this regard are those that don’t just have assumptionshard-wired into them, but allow us to state them explicitly,vary them widely, and incorporate them automatically intothe learning (e.g., using first-order logic [21] or grammars[6]).

In retrospect, the need for knowledge in learning should notbe surprising. Machine learning is not magic; it can’t getsomething from nothing. What it does is get more fromless. Programming, like all engineering, is a lot of work:we have to build everything from scratch. Learning is morelike farming, which lets nature do most of the work. Farm-ers combine seeds with nutrients to grow crops. Learnerscombine knowledge with data to grow programs.

5. OVERFITTING HAS MANY FACESWhat if the knowledge and data we have are not sufficientto completely determine the correct classifier? Then we runthe risk of just hallucinating a classifier (or parts of it) thatis not grounded in reality, and is simply encoding random

HighBias

LowBias

LowVariance

HighVariance

Figure 1: Bias and variance in dart-throwing.

quirks in the data. This problem is called overfitting, and isthe bugbear of machine learning. When your learner outputsa classifier that is 100% accurate on the training data butonly 50% accurate on test data, when in fact it could haveoutput one that is 75% accurate on both, it has overfit.

Everyone in machine learning knows about overfitting, butit comes in many forms that are not immediately obvious.One way to understand overfitting is by decomposing gener-alization error into bias and variance [9]. Bias is a learner’stendency to consistently learn the same wrong thing. Vari-ance is the tendency to learn random things irrespective ofthe real signal. Figure 1 illustrates this by an analogy withthrowing darts at a board. A linear learner has high bias,because when the frontier between two classes is not a hyper-plane the learner is unable to induce it. Decision trees don’thave this problem because they can represent any Booleanfunction, but on the other hand they can suffer from highvariance: decision trees learned on different training setsgenerated by the same phenomenon are often very different,when in fact they should be the same. Similar reasoningapplies to the choice of optimization method: beam searchhas lower bias than greedy search, but higher variance, be-cause it tries more hypotheses. Thus, contrary to intuition,a more powerful learner is not necessarily better than a lesspowerful one.

Figure 2 illustrates this.1 Even though the true classifieris a set of rules, with up to 1000 examples naive Bayes ismore accurate than a rule learner. This happens despitenaive Bayes’s false assumption that the frontier is linear!Situations like this are common in machine learning: strongfalse assumptions can be better than weak true ones, becausea learner with the latter needs more data to avoid overfitting.

1Training examples consist of 64 Boolean features and aBoolean class computed from them according to a set of “IF. . . THEN . . .” rules. The curves are the average of 100 runswith different randomly generated sets of rules. Error barsare two standard deviations. See Domingos and Pazzani [10]for details.

Page 21: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

data scientist – test it!

Laboratory for Web Science, Dr. Marcel Blatter

the curse of big data

correlations by chance

Page 22: Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Summary

Laboratory for Web Science, Dr. Marcel Blatter

•  Don’t hire a self-appointed data-hero •  Build a data science team •  Challenge potential candidates •  Be skeptic