Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Laboratory for Web Science��Dept. Informatics��

University of Applied Sciences��of Switzerland (FFHS) ��

��http://lwsffhs.wordpress.com��

http://lws.ffhs.ch ��

Follow @blattnerma��

Dr. Marcel Blattner

Laboratory for Web Science, Dr. Marcel Blatter


What is a real Data Scientist?


“Knowing the name of something does not mean

to know something” ― Richard P. Feynman


business relevant questions

fetch’n store data

data normalization

feature engineering /

modelling

testing / model assessment

knowledge generation / visualisation

Data Scientist

data driven tasks – what Data Scientists should do


number crunching



data normalization


modelling



Data Scientist



number crunching

hum

an in

terp

reta

tion



data normalization


modelling



Data Scientist


skill-cloud and a data-hero




DH



DH

This guy lives in the land of

OZ

the birth of a self-appointed data-hero


…do not hire a self-appointed data-hero…


self-appointed data-hero’s recommender engine



You did like this




You did like this

…then you might like that one as well


data science is interdisciplinary…you need a team


HAC=hacking/tech. AN=analytics/math/stats ST/KO=strategic/communicator

T E A

M

data science is interdisciplinary…you need a team


HAC=hacking/tech. AN=analytics/math/stats ST/KO=strategic/communicator

T E A

M

common language

is key

data scientists should be scientists


objectivity

falsifiability

reproducibility

Met

hodo

logy

data scientist – test it!


Challenge the candidate

•  Real business data •  Kaggle competition •  Artificially generated data



Yule-Simpson effect



curse of dimensionality



bias-variance tradeoff

4. DATA ALONE IS NOT ENOUGHGeneralization being the goal has another major consequence:data alone is not enough, no matter how much of it you have.Consider learning a Boolean function of (say) 100 variablesfrom a million examples. There are 2100 − 106 exampleswhose classes you don’t know. How do you figure out whatthose classes are? In the absence of further information,there is just no way to do this that beats flipping a coin. Thisobservation was first made (in somewhat different form) bythe philosopher David Hume over 200 years ago, but eventoday many mistakes in machine learning stem from failingto appreciate it. Every learner must embody some knowl-edge or assumptions beyond the data it’s given in order togeneralize beyond it. This was formalized by Wolpert inhis famous “no free lunch” theorems, according to which nolearner can beat random guessing over all possible functionsto be learned [25].

This seems like rather depressing news. How then can weever hope to learn anything? Luckily, the functions we wantto learn in the real world are not drawn uniformly fromthe set of all mathematically possible functions! In fact,very general assumptions—like smoothness, similar exam-ples having similar classes, limited dependences, or limitedcomplexity—are often enough to do very well, and this is alarge part of why machine learning has been so successful.Like deduction, induction (what learners do) is a knowledgelever: it turns a small amount of input knowledge into alarge amount of output knowledge. Induction is a vastlymore powerful lever than deduction, requiring much less in-put knowledge to produce useful results, but it still needsmore than zero input knowledge to work. And, as with anylever, the more we put in, the more we can get out.

A corollary of this is that one of the key criteria for choos-ing a representation is which kinds of knowledge are easilyexpressed in it. For example, if we have a lot of knowledgeabout what makes examples similar in our domain, instance-based methods may be a good choice. If we have knowl-edge about probabilistic dependencies, graphical models area good fit. And if we have knowledge about what kinds ofpreconditions are required by each class, “IF . . . THEN . . .”rules may be the the best option. The most useful learnersin this regard are those that don’t just have assumptionshard-wired into them, but allow us to state them explicitly,vary them widely, and incorporate them automatically intothe learning (e.g., using first-order logic [21] or grammars[6]).

In retrospect, the need for knowledge in learning should notbe surprising. Machine learning is not magic; it can’t getsomething from nothing. What it does is get more fromless. Programming, like all engineering, is a lot of work:we have to build everything from scratch. Learning is morelike farming, which lets nature do most of the work. Farm-ers combine seeds with nutrients to grow crops. Learnerscombine knowledge with data to grow programs.

5. OVERFITTING HAS MANY FACESWhat if the knowledge and data we have are not sufficientto completely determine the correct classifier? Then we runthe risk of just hallucinating a classifier (or parts of it) thatis not grounded in reality, and is simply encoding random

HighBias

LowBias

LowVariance

HighVariance

Figure 1: Bias and variance in dart-throwing.

quirks in the data. This problem is called overfitting, and isthe bugbear of machine learning. When your learner outputsa classifier that is 100% accurate on the training data butonly 50% accurate on test data, when in fact it could haveoutput one that is 75% accurate on both, it has overfit.

Everyone in machine learning knows about overfitting, butit comes in many forms that are not immediately obvious.One way to understand overfitting is by decomposing gener-alization error into bias and variance [9]. Bias is a learner’stendency to consistently learn the same wrong thing. Vari-ance is the tendency to learn random things irrespective ofthe real signal. Figure 1 illustrates this by an analogy withthrowing darts at a board. A linear learner has high bias,because when the frontier between two classes is not a hyper-plane the learner is unable to induce it. Decision trees don’thave this problem because they can represent any Booleanfunction, but on the other hand they can suffer from highvariance: decision trees learned on different training setsgenerated by the same phenomenon are often very different,when in fact they should be the same. Similar reasoningapplies to the choice of optimization method: beam searchhas lower bias than greedy search, but higher variance, be-cause it tries more hypotheses. Thus, contrary to intuition,a more powerful learner is not necessarily better than a lesspowerful one.

Figure 2 illustrates this.1 Even though the true classifieris a set of rules, with up to 1000 examples naive Bayes ismore accurate than a rule learner. This happens despitenaive Bayes’s false assumption that the frontier is linear!Situations like this are common in machine learning: strongfalse assumptions can be better than weak true ones, becausea learner with the latter needs more data to avoid overfitting.

1Training examples consist of 64 Boolean features and aBoolean class computed from them according to a set of “IF. . . THEN . . .” rules. The curves are the average of 100 runswith different randomly generated sets of rules. Error barsare two standard deviations. See Domingos and Pazzani [10]for details.



the curse of big data

correlations by chance

Summary


•  Don’t hire a self-appointed data-hero •  Build a data science team •  Challenge potential candidates •  Be skeptic

Talk Big Data Conference Munich - Data Science needs real Data Scientists.

Business

Transcript of Talk Big Data Conference Munich - Data Science needs real Data Scientists.