Data Science at Berkeley

download Data Science at Berkeley

of 48

Embed Size (px)


Keynote talk at PyData Silicon Valley 2014 (Facebook Headquarters) on May 4, 2014, by Prof. Joshua Bloom.

Transcript of Data Science at Berkeley

  • Data$Science$at$Berkeley Joshua'Bloom' UC'Berkeley,'Astronomy @pro%sb PyData,'May'4.'2014
  • The$First$Rule$of$Data$Science...
  • diagram'from'Drew'Conway Data$Science$as$a$Discipline
  • diagram'from'Drew'Conway Data$Science$as$a$Discipline What'are'the'Core$Principles? Is'it'an'academic$pursuit'to'be'taught'or'a' skillset'to'be'trained? When'should'they'be'taught?'At'what'level'of' depth/breath? Who'should'teach'them'and'who'should'know' data'science? Where'should'investments'be'made?'Does'Data' Science'need'an'intellectual'home'within'insHtuHons?'
  • I'love'working'with'astronomers,'since'their' data'is'worthless. K'Jim'Gray,'Microso'
  • Bayesian FrequenHst Theory/Hypothesis Driven Data Driven non-parametric parametric Data$Inference$Space Hardware---laptops--clusters/supercomputers So6ware---Python/Scipy,-R,-... Carbonware---(astro)-grad-students,-postdocs
  • mij = i + M0j + j log10 (Pi/P0) + E(B V )i [RV a (1/ i) + b (1/ i)] + ij Bayesian$Distance$Ladder Some-Variable-Stars-show-a-PeriodDBrightness-CorrelaFon i'indexes'over'individual'stars j'indexes'over'wavebands a'and'b'are'xed'constants'at'each'color Data--134'RR'Lyrae'(ultraviolet'to'infrared) Fit--307-dimensional-model-parameter-inference K'determinisHc'MCMC'model'with'PyMC K'~6'days'for'a'single'run'(one'core) K'parallelism'for'convergence'tests Klein+12;'Klein,JSB+14, Leavitt Law HenrieKa-Swan-LeaviK
  • Figure 6. Multi-band periodluminosity relations. RRab stars are in blue, RRc stars in red. Blazhko-aected stars are denoted with Sub-1% distance uncertainty Precision 3D dust in the Milky Way Bayesian$Distance$Ladder Milky Way projection distance dust Brightness Period Klein+12;'Klein,JSB+14,
  • Morphology of LMC RR Lyrae stars 3 the 62 individual science CCDs2 were pro- andard reduction algorithms (bias subtrac- g, etc.) using the computational resources at nergy Research Scientic Computing Center y on individual frames was calibrated with ence sources from the Two Micron All Sky S; Skrutskie et al. 2006) using the astrom- re package (Lang et al. 2010). Photometric performed using same-night observations of Sloan Digital Sky Survey Stripe 82 Standard Ivezic et al. 2007). Applying standard cali- dology (e.g., Ofek et al. 2012), we nd that a robust scatter in our absolute photometric . 0.02 mag on clear nights (& 50% of the ob- om the Science Verication run). While this be improved with more advanced modeling signatures (e.g., Tucker et al. 2014), we nd on is su cient for our scientic objectives. m observing program produced on average Figure 3. V -, I-, and z-band periodmagnitude relations (solid lines) derived for the LMC RR Lyrae population, superimposed on scatter plots of the RR Lyrae posteriors (M computed using Post). The dashed lines denote the 1 prediction intervals for a new RR Lyrae star with known period. used for all of the i. This standard deviation was selected to be a fractional distance error of 10 per cent ( 5 kpc), which is much larger than the depth of the LMC and signicantly larger than (> 2 times) the median posterior i . To t the model given by equation 6 ten identical MCMC traces were run, each generating 3.5 million iter- ations. The rst 0.5 million were discarded as burn-in and the remaining 3 million were thinned by 300 to result in ten traces of 10,000 iterations each. The Gelman-Rubin conver- now with 15,040 stars... scipy.sparse,'hdf5 Klein..JSB+14 Dust Map basemap 4-'improvement'in'distance'error 3D density projection mayavi2
  • Astronomical$Data$Deluge Serious$Challenge$to$Tradi;onal$Approaches$&$Toolkits Large$Synop;c$Survey$Telescope$(LSST)$A$2020$ ' Light'curves'for'800M'sources'every'3'days '''''106'supernovae/yr,'105'eclipsing'binaries '''''3.2'gigapixel'camera,'20'TB/night LOFAR$&$SKA ''''150'Gps'(27'Tops)''20'Pps'(~100'Pops) Gaia$space$astrometry$mission$A$2014 ''''1'billion'stars'observed'70'Hmes'over'5'years '''''''Will'observe'20K'supernovae Many'other'astronomical'surveys'are'already'producing'data: SDSS,'iPTF,'CRTS,'PanKSTARRS,'Hipparcos,'OGLE,'ASAS,'Kepler,'LINEAR,'DES'etc.,
  • strategy scheduling observing reducFon nding discovery classicaFon followup inference Towards$a$Fully$Automated$ScienAc$Stack$for$Transients } current state)of)the)art stack automated not-(yet)-automated published-work NSF/CDI NSF/BIGDATA
  • Our$ML$framework$found$ the$Nearest$Supernova$in$3$ Decades$.. Built'&'Deployed'robust,'realKHme' machine'learning'framework,' discovering'>10,000'events'in'>'10' TB'of'imaging' '''''''''50+'journal'arHcles Built'ProbabilisHc'Event' classicaHon'catalogs'with' innovaHve'acHve'learning' hhp:// hhps://
  • What is the toolbox of the modern (data-driven) scientist? domain training statistics advanced computing database GUI parallel visualization Bayesian machine learning Physics laboratory techniques MCMC MapReduce
  • And...How do we teach this with what little time the students have? What is the toolbox of the modern (data-driven) scientist?
  • Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series BDAS: Berkeley Data Analytics Stack [Spark, Shark, ...] parallel programming bootcamp ...and entire degree programs
  • 2010: 85 campers 2012a: 135 campers Python Bootcamps at Berkeley
  • a modern superglue computing language for (data) science high-level scripting language open source, huge & growing community in academia & industry Just in time compilation but also fast numerical computation Extensive interfaces to 3rd party frameworks
  • a modern superglue computing language for (data) science high-level scripting language open source, huge & growing community in academia & industry Just in time compilation but also fast numerical computation Extensive interfaces to 3rd party frameworks A reasonable lingua franca for scientists...
  • 2012b: 210 campers Python Bootcamps at Berkeley 2013a: 253 campers
  • 3 days of live/archive streamed lectures all open material in GitHub widely disseminated (e.g., @ NASA)
  • Part of the Designated Emphasis in Computational Science & Engineering at Berkeley visualization machine learning database interaction user interface & web frameworks timeseries & numerical computing interfacing to other languages Bayesian inference & MCMC hardware control parallelism
  • Are we alone in the universe? What makes up the missing mass of the universe? ... And maybe the biggest question of all: How in the wide world can you add $3 billion in market capitalization simply by adding .com to the end of a name? President William Jefferson Clinton Science and Technology Policy Address 21 January 2000 Add Data Science or Big Data to your course name to increase enrollment by tenfold. Joshua Bloom Just Now
  • Python for Data Science @ Berkeley [Sept 2013]
  • 64% 36% female male 8% 4% 8% 12% 4% 12% 8% 16% 16% 12%Psychology Astronomy Neuroscience Biostatistics Physics Chemical Engineering ISchool Earth and Planetary Sciences Industrial Engineering Mechanical Engineering Parallel Image Reconstruction from Radio Interferometry Data Graph Theory Analysis of Growing Graphs Realtime Prediction of Activity Behavior from Smartphone Bus Arrival Time Prediction in Spain
  • Time domain preprocessing - Start with raw photometry! - Gaussian process detrending! - Calibration! - Petigura & Marcy 2012! ! Transit search - Matched lter! - Similar to BLS algorithm (Kovcas+ 2002)! - Leverages Fast-Folding Algorithm O(N^2) O(N log N) (Staelin+ 1968)! ! Data validation - Signicant peaks in periodogram, but inconsistent with exoplanet transit TERRA optimized for small planets Detrended/calibrated photometry TERRA RawFlux(ppt)CalibratedFlux Erik Petigura Berkeley Astro Grad Student Prevalence of Earth-size planets orbiting Sun-like stars Erik A. Petiguraa,b,1 , Andrew W. Howardb , and Geoffrey W. Marcya a Astronomy Department, University of California, Berkeley, CA 94720; and b Institute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822 Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013) Determining whether Earth-like planets are common or rare looms as a touchstone in the question of life in the universe. We searched for Earth-size planets that cross in front of their host st