Eleven lessons: managing design in eleven global companies - Design Council
GalvanizeU Seattle: Eleven Almost-Truisms About Data
-
Upload
paco-nathan -
Category
Technology
-
view
622 -
download
2
Transcript of GalvanizeU Seattle: Eleven Almost-Truisms About Data
![Page 1: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/1.jpg)
Eleven Almost-Truisms About Data
2015-07-24 • Seattle
Paco Nathan, @pacoid O’Reilly Learning
![Page 2: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/2.jpg)
Set and Setting:
Almost a Dozen Almost-Truisms about Data … to consider when embarking on a journey into Data Science
There are a number preconceptions about working with data at scale, where the realities beg to differ
We’ll crank this number up to eleven – even though the actual number is of course much larger, but that’s perhaps for another day
![Page 3: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/3.jpg)
Almost a Dozen Almost-Truisms about Data … to consider when embarking on a journey into Data Science
Let’s discuss some less-intuitive directions, along with likely consequences and corollaries
This is not intended to prove a set of points, rather to provide a set of launching points
Set and Setting:
![Page 4: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/4.jpg)
#01: Because Rates
![Page 5: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/5.jpg)
The rates of data being stored and analyzed jumped quite dramatically in the late 1990s to early 2000s … partly because storage became incredibly cheap … partly because internetworked machines suddenly started producing much more machine data
Fifteen years later, the rates jump again, this time by orders of magnitude … Because IoT
It’s almost like this thing has a pulse?
#01: Because Rates
![Page 6: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/6.jpg)
In other words, to paraphrase von Schelling, experience precedes analysis
Typically, we’re swimming in data, and we tend to respond by struggling to understand its structure and dynamics
That, in contrast to the myth that our analysis drives data collection
#01: Because Rates
![Page 7: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/7.jpg)
Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware
This effort prepared the way for huge Internet successes duringthe 1997 holiday season…
AMZN, EBAY, Inktomi (YHOO Search), then GOOG
MapReduce on clusters of commodity hardware and the Apache Hadoop open source stack emerged from this context
#01: Because Rates – 1997 Q3 Inflection Point
![Page 8: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/8.jpg)
Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html
eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf
Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXM
Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx
#01: Because Rates – 1997 Q3 Inflection Point
![Page 9: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/9.jpg)
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
#01: Because Rates – Circa 2001, post e-commerce success
![Page 10: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/10.jpg)
RDBMS
SQL Queryresult sets
recommenders+
classifiersWeb Apps
customertransactions
AlgorithmicModeling
Logs
eventhistory
aggregation
dashboards
Product
EngineeringUX
Stakeholder Customers
DW ETL
Middleware
servletsmodels
“data products”
#01: Because Rates – Circa 2001, post e-commerce success
![Page 11: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/11.jpg)
Primary sources for the notion:
Cleveland, W. S., “Data Science: an Action Plan for Expanding the Technical Areas of the Field of Statistics,” International Statistical Review (2001), 69, 21-26.http://cm.bell-labs.com/stat/doc/datascience.ps
Breiman L., “Statistical modeling: the two cultures”, Statistical Science (2001), 16:199-231.http://projecteuclid.org/euclid.ss/1009213726
…also good to mention John Tukey
#01: Because Rates – Whither Data Science?
![Page 12: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/12.jpg)
Rashomon, the 1950 Japanese period drama by Akira Kurosawa, symbolizes a long-standing tension in Statistics, one which Mark Twain described ever so succinctly…
wikipedia.org/wiki/Rashomon:“The film is known for a plot device which involves various characters providing alternative, self-serving and contradictory versions of the same incident.”
#01: Because Rates – A Sea Change
![Page 13: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/13.jpg)
Because IoT! (exabytes/day per sensor)
bits.blogs.nytimes.com/2013/06/19/g-e-makes-the-machine-and-then-uses-sensors-to-listen-to-it/
#01: Because Rates – A Sea Change, Redux
![Page 14: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/14.jpg)
#02: Batch Defenestration
![Page 15: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/15.jpg)
#02: Batch Defenestration
![Page 16: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/16.jpg)
#02: Batch Defenestration
Batch AnalyticsGoing strong, since 1944 Been there, done that
![Page 17: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/17.jpg)
Businesses want to join the 21c., and level up to streaming analytics
“I saw what you did … in batch,” now performed a zillion times faster
#02: Batch Defenestration – Infrastructure, Remodeled
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache, More than 500 known production deployments
![Page 18: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/18.jpg)
Tuning Spark Streaming for ThroughputGerard Maas, 2014-12-22virdata.com/tuning-spark/
#02: Batch Defenestration – “Team Apache”, $316.4M funding
![Page 19: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/19.jpg)
Can Spark Streaming survive Chaos Monkey?Bharat Venkat, Prasanna Padmanabhan, Antony Arokiasamy, Raju Uppalapatitechblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html
#02: Batch Defenestration – Resiliency, at the edge of Comp Sci
![Page 20: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/20.jpg)
#03: Circa 1904
![Page 21: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/21.jpg)
Trending interests:
• electric cars
• organic farm-to-table cuisine
• permaculture
• sustainable urbanism
#03: Circa 1904
![Page 22: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/22.jpg)
Speaking of batch windows…
The last century or two of statistics represent an extremely huge mess
Let’s start the clock over, then move forward into a more real-time near-future
#03: Circa 1904
![Page 23: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/23.jpg)
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
Probability got going, formally, in the 16th c. – although interesting mathematical estimations trace back to classical times
Arabs in the 9th c. used frequency analysis – later rediscovered by Europeans during the early Italian Renaissance
Statistics followed, originally more about what we might call demographics – through 18th c.
![Page 24: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/24.jpg)
Laplace, Gauss, et al., bridged prob & stats in the late 18th c. using distributions (what we studied in Stats 101) to infer the probability of errors in estimates
Much of the 19th/20th c. work was about using goodness of fit tests, etc., justifying some distribution
• generally speaking, that require samples
• that, in turn, implies batch windows
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
![Page 25: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/25.jpg)
While 19th/20th c. stats work focused on defensibility
21st c. work, w.r.t. Big Data apps, focuses more on predictability – plus there’s a shift in how we make estimates…
BTW, doesn’t it seem weird to crunch through piles of data in large batch jobs, at large expense, when the results get used to approximate features ultimately? Why not perform that in stream?
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
![Page 26: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/26.jpg)
A fascinating, relatively new area pioneered by relatively few people – e.g., Philippe Flajolet
Provides approximation with error bounds using much less resources (RAM, CPU, etc.)
highlyscalable.wordpress.com/2012/05/01/probabilistic-structures-web-analytics-data-mining/
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
![Page 27: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/27.jpg)
algorithm use case example
Bloom Filter set membership code
MinHash set similarity code
HyperLogLog set cardinality code
Count-Min Sketch frequency summaries code
DSQ streaming quantiles code
SkipList ordered sequence search code
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
![Page 28: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/28.jpg)
E.g., ±4% could buy you two orders of magnitude reduction in the required memory footprint for an analytics app
OSS projects such as Algebird and BlinkDB provide for this newer approach to the math of approximations at scale
#03: Circa 1904 – Lies, Damn Lies, Statistics, Data Science
![Page 29: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/29.jpg)
#04: Your API is an Illusion
![Page 30: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/30.jpg)
IMO, many notions of “API” are illusions
Arguably, reductionist shell games
And that imposes limitations on how we work, and even how we think…
#04: Your API is an Illusion
![Page 31: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/31.jpg)
evaluationoptimizationrepresentationcirca 2010
ETL into cluster/cloud
datadata
visualize,reporting
Data Prep
Features
Learners, Parameters
UnsupervisedLearning
Explore
train set
test set
models
Evaluate
Optimize
Scoringproduction
datause
cases
data pipelines
actionable resultsdecisions, feedback
bar developers
foo algorithms
Algorithms and developer-centric template thinking only go so far in a workflow…
Results are shown in blue, while the real work is highlighted in red
#04: Your API is an Illusion – The Libraries: Alexandria, Redux
![Page 32: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/32.jpg)
On the other hand, Physics does well to teach modeling –
I like to hire physicists to work on Data teams…
They tend to get the interdisciplinary aspects: got the math background, coding experience, generally good at systems engineering, etc.
Not saying we must all rush out to get Physics degrees – there’s something to be learned there, vital for the work and priorities ahead
#04: Your API is an Illusion – The Interzone
![Page 33: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/33.jpg)
“The impact of computing extends far beyond science… affecting all aspects of our lives. To flourish in today's world, everyone needs computational thinking.” – Jeannette Wing, CMU
Computing now ranks alongside the proverbial Reading, Writing, and Arithmetic…
Center for Computational Thinking @ CMUhttp://www.cs.cmu.edu/~CompThink/
Exploring Computational Thinking @ Google https://www.google.com/edu/computational-thinking/
#04: Your API is an Illusion – Antidote: Computational Thinking
![Page 34: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/34.jpg)
#05: Code Inceptionism
![Page 35: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/35.jpg)
Even so, do we really need to write code for WordCount 10^N times?
#05: Code Inceptionism
![Page 36: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/36.jpg)
Inceptionism: Going Deeper into Neural Networks Alexander Mordvintsev, Christopher Olah, Mike TykaGoogle (2015-06-17)googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html
Artificial Neural Networks have spurred remarkable recent progress in image classification and speech recognition. But even though these are very useful tools based on well-known mathematical methods, we actually understand surprisingly little of why certain models work and others don’t. So let’s take a look at some simple techniques for peeking inside these networks.
#05: Code Inceptionism
![Page 37: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/37.jpg)
Imagine data mining GitHub commit histories of popular open source projects, then applying genetic programming to evolve patches for other OSS projects... In other words, brilliant:
Imagine data mining GitHub commit histories of popular open source projects, then apply genetic programming to evolve patches for other OSS projects… in other words, brilliant:
Sidebar: Claire Le Goues, automating software repair
Claire Le Gouescmu.edu
GenProg: A Generic Method for Automatic Software Repair Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, Westley Weimer IEEE TSE (2012) www.cs.cmu.edu/~clegoues/docs/legoues-tse-genprog12.pdf
We describe the algorithm and report experimental results of its success on 16 programs totaling 1.25M lines of C code and 120K lines of module code, spanning eight classes of defects, in 357 seconds, on average. We analyze the generated repairs qualitatively and quantitatively to demonstrate that the process efficiently produces evolved programs that repair the defect, are not fragile input memorizations, and do not lead to serious degradation in functionality.
GenProg: A Generic Method for Automatic Software Repair Claire Le Goues, ThanhVu Nguyen, Stephanie Forrest, Westley Weimer IEEE TSE (2012) www.cs.cmu.edu/~clegoues/ docs/legoues-tse-genprog12.pdf
We describe the algorithm and report experimental results of its success on 16 programs totaling 1.25M lines of C code and 120K lines of module code, spanning eight classes of defects, in 357 seconds, on average.
We analyze the generated repairs qualitatively and quantitatively to demonstrate that the process efficiently produces evolved programs that repair the defect, are not fragile input memorizations, and do not lead to serious degradation in functionality.
#05: Code Inceptionism
![Page 38: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/38.jpg)
#06: Database Extinction?
![Page 39: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/39.jpg)
Are databases going extinct?
Distributed file systems that can be accessed as column stores are generally quite useful
There’s an old saying in Computer Science: it’s difficult to distinguish a really good file system from a database, and vice versa
#06: Database Extinction?
![Page 40: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/40.jpg)
Original definitions for what became relational databases had less to do with dedicated SQL products, more similarity with something like Spark SQL:
A relational model of data for large shared data banks Edgar CoddCommunications of the ACM (1970) dl.acm.org/citation.cfm?id=362685
#06: Database Extinction?
![Page 41: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/41.jpg)
#06: Database Extinction?
Tungsten Execution
Python SQL R Streaming
DataFrame
Advanced Analytics
Set Footer from Insert Dropdown Menu 27
Physical Execution: CPU Efficient Data Structures
Keep data closure to CPU cache
Tungsten
![Page 42: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/42.jpg)
#07: “N Dims good, 2 Dims baa-d”
![Page 43: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/43.jpg)
Consider: matrices, pivot tables, etc.
Our thinking about data representation is often quite two-dimensional…
#07: “N Dims good, 2 Dims baa-d”
![Page 44: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/44.jpg)
• many real-world problems are often represented as graphs
• graphs can generally be converted into sparse matrices (bridge to linear algebra)
• eigenvectors find the stable points in a system defined by matrices – which may be more efficient to compute
• beyond simpler graphs, complex data may require work with tensors
#07: “N Dims good, 2 Dims baa-d”
![Page 45: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/45.jpg)
Suppose we have a graph as shown below:
We call x a vertex (sometimes called a node)
An edge (sometimes called an arc) is any line connecting two vertices
vu
w
x
#07: “N Dims good, 2 Dims baa-d”
![Page 46: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/46.jpg)
We can represent this kind of graph as an adjacency matrix:
• label the rows and columns based on the vertices
• entries get a 1 if an edge connects the corresponding vertices, or 0 otherwise
vu
w
x
u v w xu 0 1 0 1
v 1 0 1 1
w 0 1 0 1
x 1 1 1 0
#07: “N Dims good, 2 Dims baa-d”
![Page 47: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/47.jpg)
An adjacency matrix always has certain properties:
• it is symmetric, i.e., A = AT
• it has real eigenvalues
Therefore algebraic graph theory bridges between linear algebra and graph theory
#07: “N Dims good, 2 Dims baa-d”
![Page 48: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/48.jpg)
Tensors are a good way to handle time-series geo-spatially distributed linked data with lots of N-dimensional attributes
In other words, potentially a general case for handling much of the data that we’re likely to encounter
#07: “N Dims good, 2 Dims baa-d”
![Page 49: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/49.jpg)
Although tensor factorization is considered problematic, it may provide more general case solutions:
The Tensor Renaissance in Data Science Anima Anandkumar @UC Irvine radar.oreilly.com/2015/05/the-tensor-renaissance-in-data-science.html
Spacey Random Walks and Higher Order Markov Chains David Gleich @Purdueslideshare.net/dgleich/spacey-random-walks-and-higher-order-markov-chains
#07: “N Dims good, 2 Dims baa-d”
![Page 50: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/50.jpg)
#08: Science … and Data
![Page 51: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/51.jpg)
There is Science … and there is Data
Data Science is largely about interdisciplinary teams, largely about crossing boundaries (organizational, cognitive) that might otherwise preclude arriving at crucial insights –
In other words, about learning
It’s also about the repeatability and predictive aspects of science, where workflows combine people + automation
NB: may conflict with large portions of academia which tend to decontextualize subjects
#08: Science … and Data
![Page 52: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/52.jpg)
The Science in Data Science tends to rely on the phenomenology and modeling of complex systems (did we already mention Physics?)
Speaking of science and predictions, two important works to include:
• Charles Sanders Peirce – one of the most prolific scientists in the US, and also one of the most fierce critics (abduction, etc.)
• Karl Popper – who articulated some of the inherent risks of mixing “science”, “history”, and politics
#08: Science … and Data
![Page 53: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/53.jpg)
For excellent examples of Science and Data together, see CodeNeuro, particularly for use of notebooks:
#08: Science … and Data
![Page 54: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/54.jpg)
#09: Learning Curves are Forever
![Page 55: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/55.jpg)
Learning Curves are forever – the part you need to manage more carefully than just about anything else, especially within a social context
In some sense, this is essence of Data Science: How well do you learn?
Much of the risk in managing a Data Science team is about budgeting for learning curve
#09: Learning Curves are Forever
![Page 56: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/56.jpg)
In contrast, IT has a long history of practicing a flavor of engineering “conservatism”: highly structured process, strictly codified practices
People learn a few things well, then avoid having to struggle with learning many new things perpetually…
That leads to enormous teams and low ROI, among other badness
scale ➞
com
plexity ➞
#09: Learning Curves are Forever
![Page 57: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/57.jpg)
Throw Your Life a Curve Whitney Johnsonblogs.hbr.org/johnson/2012/09/throw-your-life-a-curve.html
Aggressively Pro-Active Learning:
• deconstruction of the cognitive bias One Size Fits All
• “makes a compelling case for personal disruption”
• “plan your career around learning curves”
• hire people who learn/re-learn efficiently
#09: Learning Curves are Forever
![Page 58: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/58.jpg)
#09: Learning Curves are Forever
Education is more than just lessons, exams, certifications, instructor evaluations, etc., … though some tools would try to reduce it to that level
What’s even more interesting is to leverage ML to understand the “distance” between the learner and the subject material
![Page 59: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/59.jpg)
#10: Books, not so much, sadly…
![Page 60: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/60.jpg)
Speaking as a former alt bookstore owner…
Sadly, we don’t use books quite as much these days:
• above ~35: buy it on Kindle
• below ~35: watch it on YouTube
#10: Books, not so much, sadly…
![Page 61: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/61.jpg)
From a publisher perspective, consider some of the risks:
• less people buy the titles
• search engines surface oh-so-much noise
• increasingly, it’s more difficult for experts to take time to author good content and keep it updated
#10: Books, not so much, sadly…
Contributors per Month to Spark
0
20
40
60
80
100
2011 2012 2013 2014 2015
Most active project at Apache, More than 500 known production deployments
![Page 62: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/62.jpg)
However, it’s unlikely that Kindle, etc., represent the end-all-be-all of publishing…
Here’s an idea: your next “book” or “video” should be able to compute something useful
#10: Books, not so much, sadly…
![Page 63: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/63.jpg)
Interactive notebooks: Sharing the codeHelen Shen Nature (2014-11-05)nature.com/news/interactive-notebooks-sharing-the-code-1.16261
#10: Books, not so much – Repeatable Science
![Page 64: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/64.jpg)
Embracing Jupyter Notebooks at O'Reilly Andrew Odewahn, 2015-05-07https://beta.oreilly.com/ideas/jupyter-at-oreilly
“O'Reilly Media is using our Atlas platform to make Jupyter Notebooks a first class authoring environment for our publishing program.”
Jupyter, Thebe, Docker, etc.
#10: Books, not so much – Something Borrowed, Something New
![Page 66: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/66.jpg)
#11: A MOOCish Edumacation?
![Page 67: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/67.jpg)
MOOCs have become popular, some are quite useful … even so, these tend to have a very low completion rate
Don’t hold your breath waiting for MOOCs to replace other modes of education
Learning generally requires a social context: for reinforcement, peer insights/modeling, and frankly some people really feel a need to be given permission to learn
#11: A MOOCish Edumacation?
![Page 68: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/68.jpg)
One problem with university study is that disciplines tend to decontextualize
GalvanizeU is rare opportunity in that way: accredited, with contextualized hands-on experience
#11: A MOOCish Edumacation?
![Page 69: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/69.jpg)
A significant improvement may be found in the notion of “flipped” or inverted classrooms
For a good example, see:
Caltech Offers Online Course with Live Lectures in Machine LearningYaser Abu-Mostafa (2012-03-30)http://www.caltech.edu/news/caltech-offers-online-course-live-lectures-machine-learning-4248
#11: A MOOCish Edumacation?
![Page 70: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/70.jpg)
So a good bit of advice about learning and Data Science … is to invert your classrooms, recontextualize, cross the boundaries to do things that matter, and leverage the hands-on social aspects of learning
Like here at GalvanizeU
Summary…
![Page 71: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/71.jpg)
Thank You
![Page 72: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/72.jpg)
contact:
Just Enough Math O’Reilly (2014)
justenoughmath.compreview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/
Intro to Apache SparkO’Reilly (2015) shop.oreilly.com/product/0636920036807.do
![Page 73: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/73.jpg)
Sometimes A Strange Notion
![Page 74: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/74.jpg)
After we’ve cleaned up data, formulated workflows in terms of monoids, used graph representation, and parallelized with a wealth of linear algebra, much of the heavy-lifting that remains on the clusters is in optimization
For example, deep learning @Google uses many layers of neural nets trained with gradient descent optimizationTaming Latency Variability and Scaling Deep Learning Jeff Dean @Google (2013) youtu.be/S9twUcX1Zp0
Vector Quantization:
![Page 75: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/75.jpg)
One advantage of quantum algorithms is to run large gradient descent problems in constant time… Reworking high-ROI apps to leverage lots of ML and large clusters, then SGD represents the datacenter cost basis, notably that part that scales…
Want to slash costs exponentially? Plug in quantum for a game-changer, maybe
Fast quantum algorithm for numerical gradient estimation Stephen P. Jordan Phys. Rev. Lett. 95, 050501 (2005) arxiv.org/abs/quant-ph/0405146 dwavesys.com
Vector Quantization:
![Page 76: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/76.jpg)
Proposal: let’s drop clusters of quantum devices into lunar polar craters, so we can handle massive vector quantization workloads
• micro-kelvin environs
• near perpetual sunlight for energy sources
• park routers at L4
• approx. $15B to finance, i.e., ~6 days DoD budget
Vector Quantization:
![Page 77: GalvanizeU Seattle: Eleven Almost-Truisms About Data](https://reader033.fdocuments.in/reader033/viewer/2022042819/55cb7d70bb61eb181d8b4596/html5/thumbnails/77.jpg)
We’ll just put this here… a couple o’ Googly projects in progress:
qCraft: Quantum Physics In Minecraft plus.google.com/u/1/+QuantumAILab/posts/grMbaaDGChH
Vector Quantization:
“We’re going back to the Moon. For good.”lunar.xprize.org