Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
Click here to load reader
-
Upload
david-johnston -
Category
Data & Analytics
-
view
710 -
download
0
Transcript of Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup
![Page 1: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/1.jpg)
Data Science Consultingor
Science meets business, again.
Third time a charm?
David Johnston
ThoughtWorks
March 17, 2014
![Page 2: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/2.jpg)
Postdocs drive the worlds economy –
Young scientists become…
Professors
![Page 3: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/3.jpg)
ThoughtWorks
• Global software consulting company
• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil,
Australia, China - over 30 worldwide.
• Privately owned by Roy Singham
• Flat hierarchy of passionate people
![Page 4: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/4.jpg)
Agile Analytics at TW
• Practiced started 2011
• Led by Ken Collier and John Spens
• About a dozen people involved
Key Theme of Ken’s book
• BI, data warehousing and analytics has largely
missed the revolution in agile methodologies. We
can do it differently.
• Probabilistic modeling
• Predictive analytics / machine learning
• Advanced BI, prescriptive analysis
• Big Data technologies
• Advanced algorithms and data structures, streaming
What we do
![Page 5: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/5.jpg)
Case Studies:
Recommendation systems for a
retailer customer. Our Bayesian
model (blue)
Healthcare group purchasing
Organization
• Problem is matching medical products by text description. Fuzzy matching.
• In place solution. Rules engine. Complicated. 60% match rate, one day required for run
• In 3 weeks we delivered a lightweight solution in python. >80% match rate, runtime of a few minutes (on a laptop).
• Later moved to Elastic Search for even better results.
![Page 6: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/6.jpg)
What exactly is data
science?
• Is this really new? - Not really
• Does the term “data science” make any sense? - Not really but so what?
• Is it just a fad? Over-hyped? – No, some times.
• Why did this term just become popular a few years back? -Productivity
• Where is this going?
• Should scientists/engineers/math-types really go and make a career doing this? Yes for most
![Page 7: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/7.jpg)
Is it new?
Of course not
Combination of many subjects:
• Mathematics and statistics – probability theory
• Machine learning
• Computer science – algorithms, data structures, data bases
• Operations research - process optimization
• Business consulting
• Software development
Where we have seen this before?
Business: Finance, Insurance, Sports, Government accounting, Retail,
Science: Physics, Astronomy, Biology
![Page 8: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/8.jpg)
Why now? : Data scientist productivity growth
crosses critical threshold for new job creation
• Salary increase over postdoc requires
~2.5 x
• Salaries in Industry are set by
productivity and supply/demand
• Crossing the threshold in productivity
Leads to new job creation
• Eventual slowing in productivity
and/or changes in supply/demand
will eventually end this burst in job
creation
• Nothing magical happened in 2005!
![Page 9: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/9.jpg)
Productivity Drivers for Data-
science
Long time scale
• Compute , Moore’s law
• The internet (duh!)
• HD and RAM price drop
• Science learns to deal with
Big Data
• Growing importance of
statistics
More recent
• Git , code –sharing
• Libraries machine learning
• Python/ R Open source
• Hadoop and ecosystem
• The Cloud, AWS
• NoSQL databases, in-mem
• Growing community in “data science” cohesion, feedback effects of popularity
![Page 10: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/10.jpg)
So what is data science now
My definition of data science:
An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.
Misses only the first one
![Page 11: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/11.jpg)
Are we there yet?Overhyped, underhyped, mis-
hyped?
• No, probably not
• Productivity growth is real
• We are solving important
problems. Plenty left.
• Big Data will probably
peak in the hype cycle
before data science
• Just watched my first
analytics commercial. IBM.“Math is not a fad”
- Aaron Erickson , ThoughtWorks
![Page 12: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/12.jpg)
Case study : Particle PhysicsData reduction par excellence
• 600 million collisions per second
• Most are boring events and are not saved
• Save ~ 100 petabytes per year
Determine existence of Higg-boson – 1 bit
Measure it’s mass to 1% ~ 1 byte
Data = Exabytes
Information = 9 bits
Compression 10^18
Goal
$9 billion per byte!
![Page 13: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/13.jpg)
Data science consulting
The good
• Always something new, always learning.
• Exposed to many different people.
• Get to see how everything works on the inside.
• See the world!
• Low career risk but still fun.
The bad
• Your clients choose you
• People problems often
more important than math
problems
• Travel can be extreme
• Your great ideas will rarely
be credited to you.
![Page 14: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/14.jpg)
Challenges in data science
consulting
• Business’s don’t yet understand the terminology, process or techniques. Much teaching involved
• Visionary CEO sends you into a not-so-visionary environment
• Problems can be vague
• Communication with business stakeholders takes much of your time
• We are still developing an effective model. More than just agile techniques
• “Built us a platform for analytics so we can become a data-driven company” Non-sequitur
• Wanting prediction of the un-predicable
• Attempting to use ML on noisy data
• When incentives and opinions are all over the map
• Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.
Common challenges Red flags
![Page 15: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup](https://reader038.fdocuments.in/reader038/viewer/2022100604/5599389c1a28abfc168b47d7/html5/thumbnails/15.jpg)
Keep offering up bold
ideas
• Look for ways for major
productivity enhancement
• Keep up on cutting-edge
literature in stats/ML
• All my best ideas for web-
apps are now successful
companies.
• Everybody laughed at
them! Data science is NOT going to be
productized.
FIN