Data Science at Berkeley
-
Upload
joshua-bloom -
Category
Data & Analytics
-
view
1.837 -
download
29
description
Transcript of Data Science at Berkeley
Data$Science$at$Berkeley
Joshua'Bloom'UC'Berkeley,'Astronomy
@pro%sb
PyData,'May'4.'2014
The$First$Rule$of$Data$Science...
diagram'from'Drew'Conway
Data$Science$as$a$Discipline
diagram'from'Drew'Conway
Data$Science$as$a$Discipline• What'are'the'Core$Principles?
• Is'it'an'academic$pursuit'to'be'taught'or'a'skillset'to'be'trained?
• When'should'they'be'taught?'At'what'level'of'depth/breath?
• Who'should'teach'them'and'who'should'know'data'science?
• Where'should'investments'be'made?'Does'Data'Science'need'an'intellectual'home'within'insHtuHons?'
“I'love'working'with'astronomers,'since'their'data'is'worthless.”
K'Jim'Gray,'Microso'
Bayesian FrequenHst
Theory/HypothesisDriven
DataDriven
non-parametric
parametric
Data$Inference$Space
Hardware---laptops-→-clusters/supercomputersSo6ware---Python/Scipy,-R,-...
Carbonware---(astro)-grad-students,-postdocs
mij = µi +M0j
+ ↵j log10 (Pi/P0)
+ E(B � V )i ⇥ [RV ⇥ a (1/�i) + b (1/�i)]
+ ✏ij
Bayesian$Distance$LadderSome-Variable-Stars-show-a-PeriodDBrightness-CorrelaFon
i'indexes'over'individual'starsj'indexes'over'wavebandsa'and'b'are'fixed'constants'at'each'color
Data--134'RR'Lyrae'(ultraviolet'to'infrared)
Fit--307-dimensional-model-parameter-inferenceK'determinisHc'MCMC'model'with'PyMCK'~6'days'for'a'single'run'(one'core)K'parallelism'for'convergence'tests
Klein+12;'Klein,JSB+14,
“Leavitt Law”
HenrieKa-Swan-LeaviK
4 C. R. Klein et al.
Figure 6. Multi-band period–luminosity relations. RRab stars are in blue, RRc stars in red. Blazhko-a↵ected stars are denoted with
diamonds, stars not known to exhibit the Blazhko e↵ect are denoted with squares. Solid black lines are the best-fitting period–luminosity
relations in each waveband and dashed lines indicate the 1� prediction uncertainty for application of the best-fitting period–luminosity
relation to a new star with known period.
c� 2013 RAS, MNRAS 000, 1–11
• Sub-1% distance uncertainty
• Precision 3D dust in the Milky Way
Bayesian$Distance$LadderMilky Wayprojection
distancedust
Brightne
ss
Period Klein+12;'Klein,JSB+14,
Morphology of LMC RR Lyrae stars 3
exposure, 61 of the 62 individual science CCDs2 were pro-cessed using standard reduction algorithms (bias subtrac-tion, flat-fielding, etc.) using the computational resources atthe National Energy Research Scientific Computing Center(NERSC3).
Astrometry on individual frames was calibrated withrespect to reference sources from the Two Micron All SkySurvey (2MASS; Skrutskie et al. 2006) using the astrom-etry.net software package (Lang et al. 2010). Photometriccalibration was performed using same-night observations ofsources in the Sloan Digital Sky Survey Stripe 82 StandardStar Catalog (Ivezic et al. 2007). Applying standard cali-bration methodology (e.g., Ofek et al. 2012), we find thatwe can achieve a robust scatter in our absolute photometriccalibration of . 0.02mag on clear nights (& 50% of the ob-serving time from the Science Verification run). While thiscalibration can be improved with more advanced modelingof instrumental signatures (e.g., Tucker et al. 2014), we findthat 2% precision is su�cient for our scientific objectives.
The DECam observing program produced on averageone 1-second exposure for each of the 30 frames each night.This sub-optimal exposure depth and cadence was neces-sitated by the oversubscription of DECam and the desireto test the instrument performance in unusual or extrememodes of operation during the science verification period.Most of the RR Lyrae targets are marginally detected insingle exposures, but this was not su�cient to produce thetraditional phase-folded light curves that provide mean-fluxmeasurements through harmonic modeling. To recover z-band mean-flux magnitudes the individual epochs of eachCCD were flux-scaled using relative photometric zero pointsmeasured with PSFex (Bertin 2011) and SExtractor (Bertin& Arnouts 1996), and then average combined with Swarp(Bertin et al. 2002). This procedure resulted in a mean er-ror on the z-band mean-flux magnitude measurements forthe final RR Lyrae sample of 0.0387 mag, which includesthe errors introduced by absolute photometric calibrationand the relative epoch-to-epoch flux-scaling.
3 RR LYRAE DISTANCE MEASUREMENTS
Distances for the individual RR Lyrae stars are measuredby using the observed, extinction-corrected V , I, and zmagnitudes in combination with the period–magnitude rela-tions. The method employed is similar to the simultaneousBayesian linear regression methodology described in Klein &Bloom (2014). A significant di↵erence is that in the presentanalysis the colour excess is considered part of the observeddata, not as a prior to which a posterior distribution is fit.The following two subsections detail the derivation of indi-vidual RR Lyrae colour excess and provide more descriptionof the specific period–magnitude relations fitting procedure.
3.1 Colour excess
The E(V � I) colour excess for each RR Lyrae star is de-rived from the observed OGLE III mean-flux magnitudes
2 One chip, C61, was not fully operable during the Science Veri-fication run.3 See http://www.nersc.gov.
Figure 2. Map of the LMC RR Lyrae stars coloured by colourexcess value, E(V � I). Median RR Lyrae colour excess value is0.228 mag, and the median error per star is 0.093 mag.
and the previously-calibrated V and I period–magnitude re-lations published in Klein & Bloom (2014). This approachis conceptually similar to that of Haschke et al. (2011), withthe main di↵erence being that the earlier study used the the-oretical V -band metallicity–luminosity and I-band period–metallicity–luminosity relations of Catelan et al. (2004). Inthe present work, colour excess is given by the subtraction ofthe absolute colour (from the period–magnitude relations)from the observed colour,
E(V � I) = (mV
�mI
)� [MV
(P )�MI
(P )]. (1)
The dominant source of error in the colour excess cal-culation is the intrinsic scatter of the period–magnitude re-lations. The median colour excess for the LMC RR Lyraepopulation is found to be 0.228 mag, with a median errorof 0.093 mag. This is significantly greater than the medianvalue of 0.11 mag (with standard deviation of 0.06 mag)found by Haschke et al. (2011). Fig. 2 is a map of the RRLyrae distribution coloured by color excess. Two promi-nent regions of large extinction are apparent, one shapedlike a downward-pointing wedge located at a right ascension⇡ 87�, and the other a band running north-south centred atright ascension ⇡ 73�. Both of these features are also notedby Haschke et al. (2011) and depicted in their Fig. 10.
The band-specific extinction for each star was derivedfrom the measured colour excess value using the extinctioncurve data given in Table 6 of Schlegel et al. (1998), in com-bination with the colour excess conversion factor to trans-form from E(V � I) to the conventional E(B � V ), 1.62,reported in Johnson (1968) and referenced by Schultz &Wiemer (1975) and Rieke & Lebofsky (1985). The correctedmean-flux magnitudes are thus given by
mV
= mV,obs
� 3.240⇥ [E (V � I) /1.62] (2)
mI
= mI,obs
� 1.962⇥ [E (V � I) /1.62] (3)
mz
= mz,obs
� 1.479⇥ [E (V � I) /1.62] . (4)
c� 2014 RAS, MNRAS 000, 1–7
4 Klein et al.
3.2 Period–magnitude relations
The V , I, and z extinction-corrected mean-flux magnitudeswere used to calibrate period–magnitude relations througha method similar to the Bayesian simultaneous linear regres-sion formalism employed for 13 simultaneous fits in Klein &Bloom (2014). The primary di↵erence in this application isthat the colour excess is not fitted as a model parameter, andis instead incorporated into the likelihood (observed data).The framework easily accommodates the extra model pa-rameters, but the augmented processing time, which goesroughly as O(n2), is unreasonable for fitting a model with15,040 stars (compared to the calibration sample size of 134for Klein & Bloom 2014).
Before the Bayesian MCMC fitting procedure was per-formed, the dataset of 17,629 stars was cleaned to reject out-liers. These are most likely foreground stars or stars withpoorly measured photometry resulting from crowding ef-fects. All stars with a median absolute deviation in mag-nitude greater than 5� for any of the three wavebands wereremoved, and then a simple least-squares linear regressionwas performed to fit preliminary period–magnitude relationsand all stars more than 4� from the best fitted line for anywaveband’s relation were also removed. 15,040 RR Lyraestars survived the cuts and made it into the calibration sam-ple.
The calibration sample is composed of 11,846 RRabstars (fundamental mode pulsators) and 3,194 RRc stars(first overtone pulsators). The RRc stars’ periods must be“fundamentalised” before deriving the period–magnitude re-lations. As in Dall’Ora et al. (2004), an RRc star’s funda-mentalised period is given by
log10
(Pf
) = log10
(Pfo
) + 0.127. (5)
The general form of the period–magnitude relation is then
mij
= µi
+M0,j
+ ↵j
log10
(Pi
/P0
) + ✏ij
, (6)
where mij
is the observed apparent, extinction-correctedmean-flux magnitude of the ith RR Lyrae star in the jthwaveband, µ
i
is the distance modulus for the ith RR Lyraestar, M
0,j
is the absolute magnitude zero point for the jthwaveband, ↵
j
is the slope in the jth waveband, Pi
is the fun-damentalised period of the ith RR Lyrae star in days, P
0
isa period normalisation factor (for consistency with Klein &Bloom 2014 we use P
0
= 0.52854 d), and the ✏ij
error termsare independent zero-mean Gaussian random deviates withvariance (�2
intrinsic,j
+ �2
mij).
The error on the extinction-corrected mean-flux magni-tudes, �
mij , was derived by propagating the error from thecontributing observed apparent magnitudes and colour ex-cess terms (see equations 2, 3, and 4). The intrinsic scatter ofthe period–magnitude relations, �
intrinsic,j
, which is addedin quadrature with �2
mijto calculate the standard deviation
of the likelihood, is adopted from the findings of Klein &Bloom (2014): �
intrinsic,V
= 0.0320, �intrinsic,I
= 0.0713, and�intrinsic,z
= 0.1153.The prior distributions for M
0,j
and ↵j
were normaldistributions centred at the fitted values for the V , I, and zperiod–magnitude relations found by Klein & Bloom (2014),with standard deviations expanded to 0.2 for M
0
and 1.5 for↵ (to allow the MCMC traces freedom to explore a widerparameter-space). The same prior, N (18.5, 0.21632), was
Figure 3. V -, I-, and z-band period–magnitude relations (solidlines) derived for the LMC RR Lyrae population, superimposedon scatter plots of the RR Lyrae posteriors (M computed usingµPost
). The dashed lines denote the 1� prediction intervals for anew RR Lyrae star with known period.
used for all of the µi
. This standard deviation was selected tobe a fractional distance error of 10 per cent (⇡ 5 kpc), whichis much larger than the depth of the LMC and significantlylarger than (> 2 times) the median posterior �
µi .To fit the model given by equation 6 ten identical
MCMC traces were run, each generating 3.5 million iter-ations. The first 0.5 million were discarded as burn-in andthe remaining 3 million were thinned by 300 to result in tentraces of 10,000 iterations each. The Gelman-Rubin conver-gence diagnostic, R, (Gelman & Rubin 1992) was computedfor each posterior model parameter (3 zero points, 3 slopes,and 15,040 distance moduli) and all are found to be well-converged (R < 1.1).
The best fitted period–magnitude relations and a scat-ter plot of the RR Lyrae posteriors (M computed usingµPost
) is presented in Fig. 3. The equations for the period–magnitude relations are
MV
= (0.448± 0.003)� (0.999± 0.038)⇥ log10
(P/P0
) (7)
MI
= (0.073± 0.002)� (1.701± 0.034)⇥ log10
(P/P0
) (8)
Mz
= (0.483± 0.002)� (1.774± 0.034)⇥ log10
(P/P0
) . (9)
These results are consistent (within 2�) with the findingspublished in Klein & Bloom (2014). The new slopes areconspicuously systematically lower, although the previousconstraints are considerably wider. The extremely tight dis-tributions for the posterior M
0
and ↵ are due to the verylarge number of RR Lyrae stars in the calibration dataset,as compared to previous studies that have used calibrationsamples of a few dozen to slightly more than one hundredstars collated from the local Milky Way field RR Lyrae pop-ulation.
c� 2014 RAS, MNRAS 000, 1–7
now with15,040 stars...
scipy.sparse,'hdf5
Klein..JSB+14
Dust Map basemap
4-✕'improvement'in'distance'error
3D density projection
mayavi2
Astronomical$Data$DelugeSerious$Challenge$to$Tradi;onal$Approaches$&$Toolkits
Large$Synop;c$Survey$Telescope$(LSST)$A$2020$' Light'curves'for'800M'sources'every'3'days'''''106'supernovae/yr,'105'eclipsing'binaries'''''3.2'gigapixel'camera,'20'TB/night
LOFAR$&$SKA''''150'Gps'(27'Tflops)'→'20'Pps'(~100'Pflops)
Gaia$space$astrometry$mission$A$2014''''1'billion'stars'observed'∼70'Hmes'over'5'years'''''''Will'observe'20K'supernovae
Many'other'astronomical'surveys'are'already'producing'data:SDSS,'iPTF,'CRTS,'PanKSTARRS,'Hipparcos,'OGLE,'ASAS,'Kepler,'LINEAR,'DES'etc.,
strategyscheduling
observingreducFon
findingdiscovery
classificaFonfollowup
inference
Towards$a$Fully$Automated$ScienAfic$Stack$for$Transients}currentstate)of)the)art
stack
automatednot-(yet)-automated
published-work
NSF/CDINSF/BIGDATA
Our$ML$framework$found$the$Nearest$Supernova$in$3$
Decades$..‣ Built'&'Deployed'robust,'realKHme'machine'learning'framework,'discovering'>10,000'events'in'>'10'TB'of'imaging'''''''''→'50+'journal'arHcles
‣ Built'ProbabilisHc'Event'classificaHon'catalogs'with'innovaHve'acHve'learning'
hhp://Hmedomain.org hhps://www.nsf.gov/news/news_summ.jsp?cntn_id=122537
What is the toolbox of the modern (data-driven) scientist?
domaintraining
statistics
advancedcomputing
database
GUI
parallel
visualization
Bayesian
machine learning
Physics
laboratory techniques
MCMC
MapReduce
And...How do we teach this with what little time the students have?
What is the toolbox of the modern (data-driven) scientist?
Data-Centric Coursework, Bootcamps, Seminars, & Lecture Series
BDAS: Berkeley Data Analytics Stack[Spark, Shark, ...]
parallelprogrammingbootcamp
...and entire degree programs
2010: 85 campers 2012a: 135 campers
Python Bootcamps at Berkeley
a modern superglue computing language for (data) science
‣ high-level scripting language‣ open source, huge & growing community in academia & industry‣ Just in time compilation but also fast numerical computation‣ Extensive interfaces to 3rd party frameworks
a modern superglue computing language for (data) science
‣ high-level scripting language‣ open source, huge & growing community in academia & industry‣ Just in time compilation but also fast numerical computation‣ Extensive interfaces to 3rd party frameworks
A reasonable lingua franca for scientists...
2012b: 210 campers
Python Bootcamps at Berkeley
2013a: 253 campers
‣ 3 days of live/archive streamed lectures‣ all open material in GitHub‣ widely disseminated (e.g., @ NASA)
http://pythonbootcamp.info
Part of the DesignatedEmphasis in Computational Science & Engineering at Berkeley
visualization
machine learning
database interaction
user interface & web frameworks
timeseries & numerical computing
interfacing to other languagesBayesian inference & MCMC
hardware control
parallelism
“Are we alone in the universe? What makes up the missing mass of the universe? ... And maybe the biggest question of all: How in the wide world can you add $3 billion in market capitalization simply by adding .com to the end of a name?”
President William Jefferson Clinton Science and Technology Policy Address
21 January 2000
“Add Data Science or Big Data to your course name to increase enrollment by tenfold.”
Joshua BloomJust Now
Python for Data Science @ Berkeley [Sept 2013]
64%
36%female male
8%4%
8%
12%
4%12% 8%
16%
16%
12%PsychologyAstronomyNeuroscienceBiostatisticsPhysicsChemical EngineeringISchoolEarth and Planetary SciencesIndustrial EngineeringMechanical Engineering
“Parallel Image Reconstruction from Radio Interferometry Data”
“Graph Theory Analysis of Growing Graphs”
http://mb3152.github.io/Graph-Growth/
“Realtime Prediction of Activity Behavior from Smartphone”
“Bus Arrival Time Prediction in Spain”
Time domain preprocessing
- Start with raw photometry!
- Gaussian process detrending!
- Calibration!
- Petigura & Marcy 2012!
!Transit search
- Matched filter!
- Similar to BLS algorithm (Kovcas+ 2002)!
- Leverages Fast-Folding Algorithm O(N^2) → O(N log N) (Staelin+ 1968)!
!Data validation
- Significant peaks in periodogram, but inconsistent with exoplanet transit
TERRA – optimized for small planets
Detrended/calibrated photometry
TERRA
Raw
Flu
x (p
pt)
Cal
ibra
ted
Flux
Erik PetiguraBerkeley Astro Grad Student
Petigura, Howard, & Marcy (2013)
Prevalence of Earth-size planets orbiting Sun-like starsErik A. Petiguraa,b,1, Andrew W. Howardb, and Geoffrey W. Marcya
aAstronomy Department, University of California, Berkeley, CA 94720; and bInstitute for Astronomy, University of Hawaii at Manoa, Honolulu, HI 96822
Contributed by Geoffrey W. Marcy, October 22, 2013 (sent for review October 18, 2013)
Determining whether Earth-like planets are common or rare loomsas a touchstone in the question of life in the universe. We searchedfor Earth-size planets that cross in front of their host stars byexamining the brightness measurements of 42,000 stars fromNational Aeronautics and Space Administration’s Kepler mission.We found 603 planets, including 10 that are Earth size (1−2 R⊕)and receive comparable levels of stellar energy to that of Earth(0:25− 4 F⊕). We account for Kepler’s imperfect detectability ofsuch planets by injecting synthetic planet–caused dimmings intothe Kepler brightness measurements and recording the fractiondetected. We find that 11 ± 4% of Sun-like stars harbor an Earth-size planet receiving between one and four times the stellar inten-sity as Earth. We also find that the occurrence of Earth-size planets isconstant with increasing orbital period (P), within equal intervals oflogP up to∼200 d. Extrapolating, one finds 5:7+1:7
−2:2% of Sun-like starsharbor an Earth-size planet with orbital periods of 200–400 d.
extrasolar planets | astrobiology
The National Aeronautics and Space Administration’s (NASA’s)Kepler mission was launched in 2009 to search for planets
that transit (cross in front of) their host stars (1–4). The resultingdimming of the host stars is detectable by measuring their bright-ness, and Kepler monitored the brightness of 150,000 stars every30 min for 4 y. To date, this exoplanet survey has detected morethan 3,000 planet candidates (4).The most easily detectable planets in the Kepler survey are
those that are relatively large and orbit close to their host stars,especially those stars having lower intrinsic brightness fluctua-tions (noise). These large, close-in worlds dominate the list ofknown exoplanets. However, the Kepler brightness measurementscan be analyzed and debiased to reveal the diversity of planets,including smaller ones, in our Milky Way Galaxy (5–7). Theseprevious studies showed that small planets approaching Earthsize are the most common, but only for planets orbiting close totheir host stars. Here, we extend the planet survey to Kepler’smost important domain: Earth-size planets orbiting far enoughfrom Sun-like stars to receive a similar intensity of light energyas Earth.
Planet SurveyWe performed an independent search of Kepler photometry fortransiting planets with the goal of measuring the underlying oc-currence distribution of planets as a function of orbital period,P, and planet radius, RP. We restricted our survey to a set of Sun-like stars (GK type) that are the most amenable to the detectionof Earth-size planets. We define GK-type stars as those with sur-face temperatures Teff = 4,100–6,100 K and gravities logg = 4.0–4.9(logg is the base 10 logarithm of a star’s surface gravity measured incm s−2) (8). Our search for planets was further restricted to thebrightest Sun-like stars observed by Kepler (Kp = 10–15 mag). These42,557 stars (Best42k) have the lowest photometric noise, makingthem amenable to the detection of Earth-size planets. Whena planet crosses in front of its star, it causes a fractional dimmingthat is proportional to the fraction of the stellar disk blocked,δF = ðRP=RpÞ2, where Rp is the radius of the star. As viewed bya distant observer, the Earth dims the Sun by ∼100 parts permillion (ppm) lasting 12 h every 365 d.
We searched for transiting planets in Kepler brightness mea-surements using our custom-built TERRA software packagedescribed in previous works (6, 9) and in SI Appendix. In brief,TERRA conditions Kepler photometry in the time domain, re-moving outliers, long timescale variability (>10 d), and systematicerrors common to a large number of stars. TERRA then searchesfor transit signals by evaluating the signal-to-noise ratio (SNR) ofprospective transits over a finely spaced 3D grid of orbital period,P, time of transit, t0, and transit duration, ΔT. This grid-basedsearch extends over the orbital period range of 0.5–400 d.TERRA produced a list of “threshold crossing events” (TCEs)
that meet the key criterion of a photometric dimming SNR ratioSNR > 12. Unfortunately, an unwieldy 16,227 TCEs met this cri-terion, many of which are inconsistent with the periodic dimmingprofile from a true transiting planet. Further vetting was performedby automatically assessing which light curves were consistent withtheoretical models of transiting planets (10). We also visuallyinspected each TCE light curve, retaining only those exhibiting aconsistent, periodic, box-shaped dimming, and rejecting thosecaused by single epoch outliers, correlated noise, and other dataanomalies. The vetting process was applied homogeneously to allTCEs and is described in further detail in SI Appendix.To assess our vetting accuracy, we evaluated the 235 Kepler
objects of interest (KOIs) among Best42k stars having P > 50 d,which had been found by the Kepler Project and identified as planetcandidates in the official Exoplanet Archive (exoplanetarchive.ipac.caltech.edu; accessed 19 September 2013). Among them, wefound four whose light curves are not consistent with beingplanets. These four KOIs (364.01, 2,224.02, 2,311.01, and 2,474.01)have long periods and small radii (SI Appendix). This exercisesuggests that our vetting process is robust and that careful scrutinyof the light curves of small planets in long period orbits is useful toidentify false positives.
Significance
A major question is whether planets suitable for biochemistryare common or rare in the universe. Small rocky planets withliquid water enjoy key ingredients for biology. We used theNational Aeronautics and Space Administration Kepler tele-scope to survey 42,000 Sun-like stars for periodic dimmingsthat occur when a planet crosses in front of its host star. Wefound 603 planets, 10 of which are Earth size and orbit in thehabitable zone, where conditions permit surface liquid water.We measured the detectability of these planets by injectingsynthetic planet-caused dimmings into Kepler brightness mea-surements. We find that 22% of Sun-like stars harbor Earth-sizeplanets orbiting in their habitable zones. The nearest such planetmay be within 12 light-years.
Author contributions: E.A.P., A.W.H., and G.W.M. designed research, performed research,analyzed data, and wrote the paper.
The authors declare no conflict of interest.
Freely available online through the PNAS open access option.
Data deposition: The Kepler photometry is available at the Milkulski Archive for SpaceTelescopes (archive.stsci.edu). All spectra are available to the public on the CommunityFollow-up Program website (cfop.ipac.caltech.edu).1To whom correspondence should be addressed. E-mail: [email protected].
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1319909110/-/DCSupplemental.
www.pnas.org/cgi/doi/10.1073/pnas.1319909110 PNAS | November 26, 2013 | vol. 110 | no. 48 | 19273–19278
AST
RONOMY
Bootcamp/Seminar Alum
Python
DOE/NERSC computation
PNAS [2014]
Dangers+of+a+Cursory+Training/Teaching+Curriculum
first-this... ...then-this.
Undergraduate$&$Graduate$Training$MissionThinking$Data'Literacy$before$Thinking$Big'Data'Proficiency
34 Fitting a straight line to data
0 50 100 150 200 250 300x
0
100
200
300
400
500
600
700
y
y = 1.33 x + 164
0 50 100 150 200 250 300x
0
100
200
300
400
500
600
700
y
0.0
1.0
1.0
1.0
0.0
0.0
0.00.00.0
0.00.0
0.0
0.00.0
0.0
0.0
0.0
0.0
0.0
0.0
Figure 10.— Partial solution to Exercise 14: On the left, the same as Figure 9but including the outlier points. On the right, the same as in Figure 4but applying the outlier (mixture) model to the case of two-dimensionaluncertainties.
0 50 100 150 200 250 300x
0
100
200
300
400
500
600
700
y
forward � � � y = (2.24 ± 0.11) x + (34 ± 18)reverse � · � y = (2.64 ± 0.12) x + (�50 ± 21)
Figure 11.— Partial solution to Exercise 15: Results of “forward and reverse”fitting. Don’t ever do this.
Data analysis recipes:
Fitting a model to data
⇤
David W. HoggCenter for Cosmology and Particle Physics, Department of Physics, New York University
Max-Planck-Institut fur Astronomie, Heidelberg
Jo BovyCenter for Cosmology and Particle Physics, Department of Physics, New York University
Dustin LangDepartment of Computer Science, University of Toronto
Princeton University Observatory
Abstract
We go through the many considerations involved in fitting a modelto data, using as an example the fit of a straight line to a set of pointsin a two-dimensional plane. Standard weighted least-squares fittingis only appropriate when there is a dimension along which the datapoints have negligible uncertainties, and another along which all theuncertainties can be described by Gaussians of known variance; theseconditions are rarely met in practice. We consider cases of general,heterogeneous, and arbitrarily covariant two-dimensional uncertain-ties, and situations in which there are bad data (large outliers), un-known uncertainties, and unknown but expected intrinsic scatter inthe linear relationship being fit. Above all we emphasize the impor-tance of having a “generative model” for the data, even an approx-imate one. Once there is a generative model, the subsequent fittingis non-arbitrary because the model permits direct computation of thelikelihood of the parameters or the posterior probability distribution.Construction of a posterior probability distribution is indispensible ifthere are “nuisance parameters” to marginalize away.
It is conventional to begin any scientific document with an introductionthat explains why the subject matter is important. Let us break with tra-dition and observe that in almost all cases in which scientists fit a straightline to their data, they are doing something that is simultaneously wrong
and unnecessary. It is wrong because circumstances in which a set of two
⇤The notes begin on page 39, including the license1 and the acknowledgements2.
1
arX
iv:1
008.
4686
v1 [
astro
-ph.
IM]
27 A
ug 2
010
Dataanalysisrecipes:
Fittingamodeltodata
⇤
David
W.H
oggCenter
forCosm
ologyan
dParticle
Physics,
Departm
entof
Physics,
New
York
University
Max-P
lanck-In
stitutfurAstron
omie,
Heid
elberg
JoB
ovyCenter
forCosm
ologyan
dParticle
Physics,
Departm
entof
Physics,
New
York
University
Dustin
Lan
gDepartm
entof
Com
puter
Scien
ce,University
ofToronto
Prin
cetonUniversity
Observatory
Abstra
ct
We
goth
rough
the
man
ycon
sideration
sin
volvedin
fittin
ga
model
todata,
usin
gas
anex
ample
the
fit
ofa
straight
line
toa
setof
poin
tsin
atw
o-dim
ension
alplan
e.Stan
dard
weigh
tedleast-sq
uares
fittin
gis
only
approp
riatew
hen
there
isa
dim
ension
along
which
the
data
poin
tshave
negligib
leuncertain
ties,an
dan
other
along
which
allth
euncertain
tiescan
be
describ
edby
Gau
ssians
ofknow
nvarian
ce;th
esecon
dition
sare
rarelym
etin
practice.
We
consid
ercases
ofgen
eral,heterogen
eous,
and
arbitrarily
covariant
two-d
imen
sional
uncertain
-ties,
and
situation
sin
which
there
arebad
data
(largeou
tliers),un-
know
nuncertain
ties,an
dunknow
nbut
expected
intrin
sicscatter
inth
elin
earrelation
ship
bein
gfit.
Above
allw
eem
phasize
the
impor-
tance
ofhav
ing
a“gen
erativem
odel”
forth
edata,
evenan
approx
-im
ateon
e.O
nce
there
isa
generative
model,
the
subseq
uen
tfittin
gis
non
-arbitrary
becau
seth
em
odel
perm
itsdirect
computation
ofth
elikelih
ood
ofth
eparam
etersor
the
posterior
prob
ability
distrib
ution
.C
onstru
ctionof
aposterior
prob
ability
distrib
ution
isin
disp
ensib
leif
there
are“n
uisan
ceparam
eters”to
margin
alizeaw
ay.
Itis
conventional
tobegin
anyscientifi
cdocu
ment
with
anintrod
uction
that
explain
sw
hyth
esu
bject
matter
isim
portant.
Let
us
break
with
tra-dition
and
observe
that
inalm
ostall
casesin
which
scientistsfit
astraight
line
toth
eirdata,
they
aredoin
gsom
ethin
gth
atis
simultan
eously
wrong
andunnecessary.
Itis
wron
gbecau
secircu
mstan
cesin
which
aset
oftw
o
⇤The
notes
begin
onpage
39,in
cludin
gth
elicen
se1
and
the
acknow
ledgem
ents
2.
1
arXiv:1008.4686v1 [astro-ph.IM] 27 Aug 2010
Sta8s8cal+Inference
Undergraduate$&$Graduate$Training$MissionThinking$Data'Literacy$before$Thinking$Big'Data'Proficiency
Versioning+&+Reproducibility
“Recently, the scientific community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible.” McNutt, 2014
http://www.sciencemag.org/content/343/6168/229.summary
K'Git'has'emerged'as'the'de'facto'versioning'toolK'Berkeley'Common'Environment'(BCE)'Sonware'StackK'“Reproducible'and'CollaboraHve'StaHsHcal'Data'Science”'(StaHsHcs'157:'P.'Stark)
Undergraduate$&$Graduate$Training$MissionThinking$Data'Literacy$before$Thinking$Big'Data'Proficiency
The'IPython'Notebook'–'SupporHng'Research'at'Berkeley
•Designing'nuclear'reactor'cores
•SimulaHng'electron'flow'in'plasmas
•CompuHng'supernovae'spectra
•Analyzing'brain'acHvity
•Modeling'neural'networks
•CalculaHng'quantum'dynamics'and'spectroscopy
•Visualizing'MRI'results
Berkeley Data Analytics Stack A Comprehensive Big Data Reference Architecture
AMP!Alpha or
Soon!
AMP!Released!
BSD/Apache!
3rd Party Open Source!
Apache Mesos! YARN Resource Manager! Resource!
HDFS / Hadoop Storage!
Tachyon!Storage!
!Apache Spark!
Spark Streaming! ML-lib! Processing!
and Data !Management!
Applications: Traffic, Carat, Genomics, 3rd Party!Tools: Visualization, Data Cleaning, …!
Shark (SQL)!
BlinkDB! GraphX! MLBase! Analytics!Frameworks!Spark-R!
Berkeley$Data$Analy;cs$StackA,Comprehensive,Big,Data,Reference,Architecture
Methodology'innovaHon:'InvenHng'What’s'Next'in'Big'Data'AnalyHcs
Testing the Vision/Early Adopters/Momentum 2010+
(Recent)'Data'Science'Industry'Spinoffs'at'Berkeley
http://berkeleystartupcluster.com/
Data'Science'growing'organically'everywhere
Feb'15,'2013
AMP'LabIon$Stoica,$CSMichael$Franklin,$CS
Adam$Arkin,$Bioengineering
Emmanuel$Saez,$Economics
Reconstruc;ng$the$moviesin$your$mind
Bin$Yu,$Sta;s;csJack$Gallant,$Neuroscience
Earthquake
Strong Shaking
in
11seconds
Richard$Allen$Earth&$Plan.$ScienceGeospa;al$Lab
Fernando$Perez,$Brain$Imaging$CenteriPython$tools$and$community Charles$Marshall
Rosie$GillespieIntegra;ve$BiologyDigi;zed$Museum
Created by Natalia Bilenko.Data source: PubMed Central. http://sciencereview.berkeley.edu/bsr_design/Issue26/datascience/cover/
Established-CS/Stats/Math-in,Serviceof-novelty-in-domain-science
vs.
Novelty-in-domain-science-driving-&-informing-novelty-in-CS/Stats/Math
“novelty2$problem”an'extra'Burden'for'Forefront'ScienHsts
hhps://medium.com/techKtalk/dd88857f662
Berkeley Institute for
Data Science
Berkeley Institute for
Data Science
http://bitly.com/bundles/fperezorg/1
“Bold new partnership launches to harness potential of data scientists and big data”
Founded'in'December'2013'as'a'result'of'a'year+'long'naHonal'selecHon'process$37.8M'over'5'years,'along'with'University'of'Washington'&'NYU
‣ An'accelerator'for'dataKdriven'discovery‣ An'agent$of$change'in'the'modern'university'as'Data'Science'takes'hold‣ An'incubator'for'the'next'generaHon'of'Data'Science'technology'and'pracHce
Leadership$from$across$the$spectrum
Joshua$Bloom,'Professor,'Astronomy;'Director,'Center'for'Time'Domain'InformaHcs''''
Henry$Brady,'Dean,'Goldman'School'of'Public'Policy''''''
Cathryn$Carson,'Associate'Dean,'Social'Sciences;'AcHng'Director'of'Social'Sciences'Data'Laboratory'"DKLab”''''''
David$Culler,'Chair,'EECS'''''''
Michael$Franklin,'Professor;'EECS,'CoKDirector,'AMP'Lab'''''''''
Erik$Mitchell,'Associate'University'Librarian'''''''''
Faculty'Lead/PI:'Saul$PerlmuXer,'Physics,'Berkeley'Center'for'Cosmological'Physics
Fernando$Perez,'Researcher,'Henry'H.'Wheeler'Jr.'Brain'Imaging'Center
Jasjeet$Sekhon,'Professor,'PoliHcal'Science'and'StaHsHcs;'Center'for'Causal'Inference'and'Program'EvaluaHon''''''
Jamie$Sethian,'Professor,'MathemaHcs''''''
Kimmen$Sjölander,'Professor,'Bioengineering,'Plant'and'Microbial'Biology''''''
Philip$Stark,'Chair,'StaHsHcs'''''
Ion$Stoica,'Professor,'EECS;'CoKDirector,'AMP'Lab
BIDS goals
‣ Support$meaningful$and$sustained$interac;ons$and$collabora;ons$between'Methodology'fields'&'Science'domains'to'recognize'what'it'takes'to'move'these'fields'forward
‣ Establish$new$Data$Science$career$paths$that$are$longAterm$and$sustainable• A'generaHon'of'mulHKdisciplinary'scienHsts'in'dataKintensive'science• A'generaHon'of'data'scienHsts'focused'on'tool'development
‣ Build$an$ecosystem$of$analy;cal$tools,$teaching,$&$research$prac;ces• Sustainable,'reusable,'extensible,'easy'to'learn'and'to'translate'across'research'domains
• Enables'scienHsts'to'spend'more'Hme'focusing'on'their'science
37
A'place'to'bring'it'all'together'at'the'Center
Vibrant'nexus'in'the'heart'of'campus
Doe-Library
Enhancing'strengths'of:
•Simons-InsFtute-for-the-Theory-of-CompuFng•-AMP-Lab•-CITRIS•etc.
Doe'Memorial'Library@'the'center'of'UC'Berkeley
Berkeley Institute for Data Science Opening
Berkeley Institute for Data Science Opening
Berkeley/UW/NYU$Working$Groups$as$Bridges$
Applied'Math
/
Towards+an+Inclusive+EcosystemExpanding+Par8cipa8on+Among+Underrepresented+Groups
11%
56%
33% female maledecline'to'state
2013'Python'bootcamp
K'2013'AMP'Camp:'''<'5%'women
K'Today'@'PyData:'''''1'women'out'of'18'speakers
K'2013'Python'Seminar:''36%'women
Chris-MentzelMoore-FoundaFon
@NYU,-on-Monday
Josh-GreenbergSloan-FoundaFon
Yann-LeCunNYU/Facebook
Summary
Data+science+at+Berkeley+is+thriving+and+is+geJng+an+intellectual+home---D-incubaFng-novel-science-&-methodologies---D-teaching-&-training---D-innovate-environments,-interacFons,-&-networks
A+data+scien8st+is+a+unicorn,+but...
Looking-for-founda8onal+industry+partners-to-parFcipate-and-help-us-grow
@pro{sb
Thank+you.
PyData,'May'4.'2014