Trends & Lessons from Silicon Valley: For Startups December 2013.
Disrupting with Data: Lessons from Silicon Valley
-
Upload
anand-rajaraman -
Category
Technology
-
view
70 -
download
0
Transcript of Disrupting with Data: Lessons from Silicon Valley
50-fold Growth from 2010 to 2020
3
2014: More bits in the
digital universe than stars in the physical
universe
Sources of Data
• The world creates 1.7MB of data per minute per person4The Digital Universe -- IDC Report, 2014
Talk outline• The evolution of data-driven applications
• 5 generations
• Lessons and Opportunities• From the intersection of startups, venture capital, and
research• Key theme: Disruption vs Optimization
• Conclusion
6
Follow the Data!• Value-creation has followed the most valuable data sources available!
• 5 overlapping generations
8
Data driven apps: The First Generation
• All about leveraging private, structured data assets for competitive advantage• E.g., Sales, inventory, payroll, …
9
Data-Driven Apps: The Third Generation
• Leveraging the power of “semi-public” Social + Mobile Data • Personal data shared in a frictionless manner with
user’s consent
11
4G Example: Paysa
14
• Am I being compensated fairly?• 2012 Stanford CS grad• Java, C++, Ruby, and Machine Learning• Software Eng II at Google
4G Example: Paysa
15
Salaries35M+ salary datapoints
Companies500k+
companies
PeopleProfessional
DNA of15M tech
employees
JobsMillions of
job postings updated daily
Local/National Government Databases
Partnerships(e.g., Udacity)
Recruiters
Companies Web Crawl
Social Media
Private Public
The Fifth Generation: Just add AI!
16
• Companies generate massive amounts of training data• New class of proprietary data
Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given
21
Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given
22
The “Why?” Question
• Why are signupsdown this week?
• Why did this marketing campaign do so well?
• Why did this A/B test not perform?
27
Trends and Takeaways• Infrastructure is available and solid
• Major transition from Hadoop to Spark
• Investment focus on “Vertical” analytics plays• e.g., Cuberon, Ayasdi
• The Age of the Intelligent App has dawned• Major opportunities and investment dollars flowing here!• e.g., Troo.ly, Descartes Labs, DocsApp
32
Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given
33
Why does disruption happen?• Data scientist as advisor not decision maker
• Domain expertise and experience often win out over data
• Data-driven approach enables a completely different business model• E.g., A la carte streaming vs fixed number of channels• Cannibalization concerns
• Fear of making mistakes• Algorithms can make mistakes• But algorithms can learn and improve much faster with data!
37
Why does disruption happen?• Classic Innovator’s Dilemma with a turbo-boost: data network effects • Accelerates the pace of disruption
38
Disruption Example: Venture Capital• Venture Capital has been an established industry for several decades• Process has not changed much since early days• VC firms expect entrepreneurs to approach them with
pitches
• Some VC firms have tried using data• Data scientists in advisory role• Not partners who make investment decisions
• High concentration in Silicon Valley• And a few other places…
39
More Global Startups
41
Reduced costs to launch a startup
Large consolidating markets; smartphone ubiquity
Emerging Market Opportunities
Untapped talent pools
Beyond Human Scale
42
2.1 Million “Startups”
115K need funding at any time
90% outside Silicon Valley
12.8 Million Companies
Why Data-Driven? Geography
43
0
10
20
30
40
50
60
70
80
90
100
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014
Coun
t
Numberof$Bcompaniesbyyear
SiliconValley OutsideSiliconValley
Business Model Innovation• Proactively identify interesting companies and reach out to them at the appropriate moment
45
South America9%
East Europe
11%
China13%
India7%Other
East Asia11%
Other Europe5%
Other North America
7%
US SF11%
US Other22%
Unknown4%
Optimize or Disrupt?• Key question for every entrepreneur (and researcher too!)• Often difference between success and failure
• Hard to answer in general, but look out for disruption cues• Established, fragmented industry• Slow to adopt latest technology trend• Asset-heavy models
• Risk/reward tradeoff• Disruption is much riskier but the rewards compensate
46
Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given
47
Peripheral Vision• To make optimal decisions, humans must provide “peripheral vision” to model
• Is this data point an outlier or does it fit the model?• e.g., Geo or category in VC
• Is there bias in the model?• e.g., historical racial gap in sentencing and parole decisions
• Has the world changed in a way that invalidates the assumption of the model?• e.g., flash crash on Wall Street
50
The Problem•Must judges, policemen, doctors, bureaucrats understand the nuances of the data and the model?
•Even trickier when we consider complex workflows involving multiple decision makers• e.g., a drug trial
51
The Opportunity• Systems that include humans and models as peers• Can also be complex workflows that involve many
humans and models
• How best to structure such systems to produce optimal decisions?• Model might need to be tuned to work with specific
human
• Model Invalidation• Can models know when they are no longer valid?
52
Is it time to disrupt Mechanical Turk?• The world has changed a lot since Mechanical Turk was introduced in 2005
• Can we move closer to true hybrid human-machine computing?• Harness both human initiative and
computing power• Harness sensors in phones
• Reimagine problems, tasks and incentives
53
Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given
54
The Agency Problem•Each model is optimized for the good of the company that owns it
•Often our goals and the company’s goals are in alignment but not always!
56
Problems• Privacy
• Everyone has your data and is modeling your actions
• Pricing and Discovery disadvantage• You discover only what they choose to show you
• You are not a population• Each service models its population of users• And is optimizing for its own ends
• Would you rather be explored or exploited?
57
Cyborg Layer Services• Privacy protection
• e.g., using Differential Privacy techniques• Or by strategically spreading interactions across services• e.g., watch some movies on Netflix and some on Amazon
• Discovery and Pricing • Looks at a larger selection and picks items for you• Acts strictly as your agent; no conflict
• Combine personal and population models• Cyborg has complete access to all my data• External services have population data, but only limited
window
63
Lessons and Opportunities1. The Age of the App2. Disruption vs Optimization3. Human-Machine Collaboration4. The Rise of the Cyborg5. The Data is not a Given
65
How to build a Model: Conventional View
• Use ground truth to build the best model possible• Feature engineering + model selection• Maybe some data cleaning and integration
66
Example: Troo.ly2005
TRANSACTIONS2015
EXPERIENCES
Need for online trust has grown dramatically!
Would you rent your house to this stranger?
Can you trust the ground truth?
! Bad users might have a good label if they haven’t engaged in bad activity yet
Labels may be incorrect if they are coming from bad internal models
Labels may be incorrect because of wrong attributions in bad transactions
!
!
Rocketship.vc: company data
70
• How to tradeoff data sources based on Coverage, Accuracy, Depth, Freshness, and Cost?
• Which subset of data sources yields the best model?
• Which subset of data sources will identify promising companies most quickly?
• Promising start• Dong et al, VLDB 2012• Rekatsinas et al, SIGMOD
2014
Algorithmic Law Enforcement
71The Economist, August 20, 2016
But what about perpetuating bias against minorities?
Summary• Cannot trust the given data completely
• Ground truth is often neither true nor grounded• Data may have bias
• Look for additional data that can improve model• Quality/cost tradeoff?
• Generate your own training data!• E.g., Polarr photo-editing app• Data Programming (Ratner et al, 2016)
72
Summary• 5 generations of data-driven applications
• Lessons and Opportunities1. The Age of the Intelligent App2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given
74
Identity Crisis?
75
Data Management
Semantic Web
Machine Learning
Data MiningInformation Retrieval
AI
Systems
Panel at NorCal DB Day, 2016
Data impacts every human endeavor
77
Data
Entertainment
Transportation
Government
ManufacturingSciences
Education
Security
Commerce
Data + X• Core identity of the field is to create value from data• Never a better time for it!
• Data is now a key part of every field of human endeavor• Stanford CS+X
• The value of being an outsider
78
Go Forth And Disrupt!
79
Entertainment
Transportation
Government
ManufacturingSciences
Education
Security
Commerce
IIT Madras CS Visiting Chair Program • Focus area: data-driven approaches to tackle important problems
• Leading faculty/researchers from around the world welcome!
• Flexible time commitment• Minimum 2 weeks
• Endowed by Venky Harinarayanand Anand Rajaraman
81
Confirmed Visiting Chairs so far…
82
Jeff UllmanProfessor Emeritus, CSStanford
Randy KatzDistinguished Professor, EECS UC Berkeley
Hari BalakrishnanProfessor, EECSMIT