Statistics: the grammar of Data Science

37
STATISTICS THE GRAMMAR OF DATA SCIENCE ÍCARO MEDEIROS Big Data Week São Paulo - SP, 23/11/2015

Transcript of Statistics: the grammar of Data Science

Page 1: Statistics: the grammar of Data Science

STATISTICS THE GRAMMAR OF DATA SCIENCE

ÍCARO MEDEIROS

Big Data WeekSão Paulo - SP, 23/11/2015

Page 2: Statistics: the grammar of Data Science

WHY TO PURSUE SOLID BACKGROUND ON STATISTICS?

https://twitter.com/josh_wills/status/198093512149958656

Page 3: Statistics: the grammar of Data Science

STATISTICS TO HELP NOT GETTING INTO THE DANGER ZONE

http://berkeleysciencereview.com/scientific-collaborations-uc-berkeley-data-driven-cover/

Page 4: Statistics: the grammar of Data Science

INSPIRATIONS FOR THIS KEYNOTE

https://speakerdeck.com/jakevdp/statistics-for-hackershttps://www.goodreads.com/book/show/17986418-naked-statistics

Page 5: Statistics: the grammar of Data Science

METRICS

MEANINGMETHODS

MISUSES-+++

Page 6: Statistics: the grammar of Data Science

HOW TO LIE: MOVIE REVIEWS

http://fivethirtyeight.com/features/fandango-movies-ratings

BECAUSE IT LOOKS LIKE MATH, WE [THINK] IT’S SOMEHOW OBJECTIVELY TRUE, BUT IT’S ALL BASED ON SUBJECTIVE EXPERIENCE

Page 7: Statistics: the grammar of Data Science

FANDANGO LOVES MOVIES

http://fivethirtyeight.com/features/fandango-movies-ratings

(AND SELLS MOVIE TICKETS)

Page 8: Statistics: the grammar of Data Science

ATTENTION TO PROVENANCE

Page 9: Statistics: the grammar of Data Science

HOW TO LIE: ROUNDING

http://fivethirtyeight.com/features/fandango-movies-ratings

Page 10: Statistics: the grammar of Data Science

HOW TO LIE: ROUNDING

http://fivethirtyeight.com/features/fandango-movies-ratings

Page 11: Statistics: the grammar of Data Science

CORRELATION IS NOT CAUSATION

Page 12: Statistics: the grammar of Data Science

SHORT BREAKS AT WORK "CAUSE" CANCER

Example from ‘Naked Statistics'

Page 13: Statistics: the grammar of Data Science

SMOKINGCAUSES CANCER

BREAKS AT WORK

CORRELATED WITH

LEAD TO

Page 14: Statistics: the grammar of Data Science

SPURIOUS CORRELATIONS

http://www.tylervigen.com/spurious-correlations

Page 15: Statistics: the grammar of Data Science

A/B TESTING

https://vwo.com/ab-testing/

Page 16: Statistics: the grammar of Data Science

A/B TESTING CAN BE BAD

https://www.quora.com/When-should-A-B-testing-not-be-trusted-to-make-decisions/answer/Edwin-Chen-1

▸Feedback loops

▸Novelty effect

▸Seasonality

▸Wrong metrics

http://www.evanmiller.org/how-not-to-run-an-ab-test.html

Page 17: Statistics: the grammar of Data Science

CHOOSE THE RIGHT METRICS: CLICKS VS DWELL TIME

http://yahoolabs.tumblr.com/post/99405569711/science-powering-product-and-personalization

Page 18: Statistics: the grammar of Data Science

CHOOSE THE RIGHT METRICS: SHARING IS NOT NECESSARILY CARING

http://time.com/12933/what-you-think-you-know-about-the-web-is-wrong/

Page 19: Statistics: the grammar of Data Science

https://xkcd.com/882 (Significant)

SIGNIFICANCE

Page 20: Statistics: the grammar of Data Science
Page 21: Statistics: the grammar of Data Science
Page 22: Statistics: the grammar of Data Science

THE RED CARD PROBLEM

http://fivethirtyeight.com/features/science-isnt-broken/http://www.nature.com/news/crowdsourced-research-many-hands-make-tight-work-1.18508

Page 23: Statistics: the grammar of Data Science

61 RESEARCHERS: SAME PROBLEM, DIFFERENT METHODS

http://fivethirtyeight.com/features/science-isnt-broken/

Page 24: Statistics: the grammar of Data Science

THE BACON CONTROVERSY

Page 25: Statistics: the grammar of Data Science

MORE ABOUT P-VALUES

https://twitter.com/Ted_Underwood/status/658983555008040960

Page 26: Statistics: the grammar of Data Science

IS THIS A GOOD CLASSIFICATION?

http://www.wired.com/2015/10/who-does-bacon-cause-cancer-sort-of-but-not-really/

1 CARCINOGENIC

2A PROBABLY

2B POSSIBLY

Page 27: Statistics: the grammar of Data Science

EFFECT SIZE: BACON VS CIGARETTES (SAME CATEGORY)

This is bacon

18%

Cigarette

Page 28: Statistics: the grammar of Data Science

WAIT FOR IT…

Page 29: Statistics: the grammar of Data Science

Cigarette

2500%

http://www.wired.com/2015/10/who-does-bacon-cause-cancer-sort-of-but-not-really/

Page 30: Statistics: the grammar of Data Science

http://www.theguardian.com/society/2015/oct/26/bacon-ham-sausages-processed-meats-cancer-risk-smoking-says-who

https://catracalivre.com.br/geral/sustentavel/indicacao/muito-alem-do-bacon-agrotoxicos-tambem-podem-causar-cancer/

Page 31: Statistics: the grammar of Data Science

THE SCHRODINGER’S DIET

http://www.vox.com/2015/5/20/8621527/health-tips-reporter

Page 32: Statistics: the grammar of Data Science

http://nerds.airbnb.com/scaling-data-science

DATA SCIENCE IS AN ACT OF INTERPRETATION OF CUSTOMER'S VOICE

Page 33: Statistics: the grammar of Data Science

GOOD DATA VISUALIZATION: TIPS FOR SCATTER PLOTS

http://content.visage.co/hs-fs/hub/424038/file-2094950163-pdf/Data_Visualization_101_How_to_Design_Charts_and_Graphs.pdf

Page 34: Statistics: the grammar of Data Science

DAVID MCCANDLESS: INFORMATION IS BEAUTIFUL

http://www.informationisbeautiful.net/visualizations/diversity-in-tech/

Page 35: Statistics: the grammar of Data Science

IT’S EASY TO LIE WITH STATISTICS, BUT IT’S HARD TO TELL THE TRUTH WITHOUT THEM

Andrejs Dunkels, as mentioned on "Naked Statistics"

TAKEAWAY MESSAGE

Page 36: Statistics: the grammar of Data Science

WHY PYTHON IS BETTER FOR DATA SCIENCE

MY NEXT TALK

São Paulo Big Data MeetupSão Paulo - SP, 25/11/2015VivaReal Portal Imobiliário. Rua Bela Cintra, 539 - Consolação

http://www.meetup.com/pt/Sao-Paulo-Big-Data-Meetup

Page 37: Statistics: the grammar of Data Science

slides icaromedeiros.com.br

slideshare.net/icaromedeiros

@icaromedeiros