Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

15

Click here to load reader

Transcript of Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Page 1: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Data Science Consultingor

Science meets business, again.

Third time a charm?

David Johnston

ThoughtWorks

March 17, 2014

Page 2: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Postdocs drive the worlds economy –

Young scientists become…

Professors

Page 3: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

ThoughtWorks

• Global software consulting company

• HQ in Chicago. Major offices in NY, San Fran, Dallas, India, Brazil,

Australia, China - over 30 worldwide.

• Privately owned by Roy Singham

• Flat hierarchy of passionate people

Page 4: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Agile Analytics at TW

• Practiced started 2011

• Led by Ken Collier and John Spens

• About a dozen people involved

Key Theme of Ken’s book

• BI, data warehousing and analytics has largely

missed the revolution in agile methodologies. We

can do it differently.

• Probabilistic modeling

• Predictive analytics / machine learning

• Advanced BI, prescriptive analysis

• Big Data technologies

• Advanced algorithms and data structures, streaming

What we do

Page 5: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Case Studies:

Recommendation systems for a

retailer customer. Our Bayesian

model (blue)

Healthcare group purchasing

Organization

• Problem is matching medical products by text description. Fuzzy matching.

• In place solution. Rules engine. Complicated. 60% match rate, one day required for run

• In 3 weeks we delivered a lightweight solution in python. >80% match rate, runtime of a few minutes (on a laptop).

• Later moved to Elastic Search for even better results.

Page 6: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

What exactly is data

science?

• Is this really new? - Not really

• Does the term “data science” make any sense? - Not really but so what?

• Is it just a fad? Over-hyped? – No, some times.

• Why did this term just become popular a few years back? -Productivity

• Where is this going?

• Should scientists/engineers/math-types really go and make a career doing this? Yes for most

Page 7: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Is it new?

Of course not

Combination of many subjects:

• Mathematics and statistics – probability theory

• Machine learning

• Computer science – algorithms, data structures, data bases

• Operations research - process optimization

• Business consulting

• Software development

Where we have seen this before?

Business: Finance, Insurance, Sports, Government accounting, Retail,

Google

Science: Physics, Astronomy, Biology

Page 8: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Why now? : Data scientist productivity growth

crosses critical threshold for new job creation

• Salary increase over postdoc requires

~2.5 x

• Salaries in Industry are set by

productivity and supply/demand

• Crossing the threshold in productivity

Leads to new job creation

• Eventual slowing in productivity

and/or changes in supply/demand

will eventually end this burst in job

creation

• Nothing magical happened in 2005!

Page 9: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Productivity Drivers for Data-

science

Long time scale

• Compute , Moore’s law

• The internet (duh!)

• HD and RAM price drop

• Science learns to deal with

Big Data

• Growing importance of

statistics

More recent

• Git , code –sharing

• Libraries machine learning

• Python/ R Open source

• Hadoop and ecosystem

• The Cloud, AWS

• NoSQL databases, in-mem

• Growing community in “data science” cohesion, feedback effects of popularity

Page 10: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

So what is data science now

My definition of data science:

An interdisciplinary field utilizing statistics, computer science and the methods of scientific research in areas outside of science.

Misses only the first one

Page 11: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Are we there yet?Overhyped, underhyped, mis-

hyped?

• No, probably not

• Productivity growth is real

• We are solving important

problems. Plenty left.

• Big Data will probably

peak in the hype cycle

before data science

• Just watched my first

analytics commercial. IBM.“Math is not a fad”

- Aaron Erickson , ThoughtWorks

Page 12: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Case study : Particle PhysicsData reduction par excellence

• 600 million collisions per second

• Most are boring events and are not saved

• Save ~ 100 petabytes per year

Determine existence of Higg-boson – 1 bit

Measure it’s mass to 1% ~ 1 byte

Data = Exabytes

Information = 9 bits

Compression 10^18

Goal

$9 billion per byte!

Page 13: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Data science consulting

The good

• Always something new, always learning.

• Exposed to many different people.

• Get to see how everything works on the inside.

• See the world!

• Low career risk but still fun.

The bad

• Your clients choose you

• People problems often

more important than math

problems

• Travel can be extreme

• Your great ideas will rarely

be credited to you.

Page 14: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Challenges in data science

consulting

• Business’s don’t yet understand the terminology, process or techniques. Much teaching involved

• Visionary CEO sends you into a not-so-visionary environment

• Problems can be vague

• Communication with business stakeholders takes much of your time

• We are still developing an effective model. More than just agile techniques

• “Built us a platform for analytics so we can become a data-driven company” Non-sequitur

• Wanting prediction of the un-predicable

• Attempting to use ML on noisy data

• When incentives and opinions are all over the map

• Convinced that the problem has been solved 20 years ago. E.g. linear regression, segmentation model, SAS.

Common challenges Red flags

Page 15: Data Science Consulting at ThoughtWorks -- NYC Open Data Meetup

Keep offering up bold

ideas

• Look for ways for major

productivity enhancement

• Keep up on cutting-edge

literature in stats/ML

• All my best ideas for web-

apps are now successful

companies.

• Everybody laughed at

them! Data science is NOT going to be

productized.

FIN