Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ......
Transcript of Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ......
![Page 1: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/1.jpg)
Staying Ahead of the Data AvalancheChallenges and Opportunities in Analytics
Prof. Dr. Seppe vanden BrouckeSAS Analytics Experience Rome – 8 November 2016
![Page 2: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/2.jpg)
Presenter: Seppe vanden Broucke
• Assistant professor in Data and Process Science at department of Decision Sciences and Information Management at KU Leuven (Belgium)
• PhD in Applied Economics at KU Leuven, Belgium in 2014• Title: Advances in Process Mining: Artificial Negative Events and Other Techniques
• Research: business data mining and analytics, machine learning, process management, process mining
• Contact: www.dataminingapps.com [email protected]
![Page 3: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/3.jpg)
BIGDATA
![Page 4: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/4.jpg)
“We live in a data flooded world”
![Page 5: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/5.jpg)
“Making sense of mountains of data” aka
“Scale your data mountain”
![Page 6: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/6.jpg)
“The data avalanche”“Data is
the new
oil”
“The data tsunami”
![Page 7: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/7.jpg)
BIGDATA
“It all sounds kind
of dangerous”
![Page 8: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/8.jpg)
BIGDATA
DATASCIENCE+ =
But so many success stories…
&ANALYTICS
![Page 9: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/9.jpg)
“We live in magical times”
![Page 10: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/10.jpg)
Uber
![Page 11: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/11.jpg)
![Page 12: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/12.jpg)
![Page 13: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/13.jpg)
Contextual RNN-GANs for Abstract
Reasoning Diagram Generation
Arnab Ghosh*, Viveka Kulharia*, Amitabha
Mukerjee, Vinay Namboodiri, Mohit Bansal
Measuring an Artificial Intelligence System's
Performance on a Verbal IQ Test For Young Children
Stellan Ohlsson, Robert H. Sloan, György Turán, Aaron
Urasky
![Page 14: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/14.jpg)
BIGDATA “Let the good
times roll”
DATAANALYTICS
+
![Page 15: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/15.jpg)
So why do so many projects fail?
![Page 16: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/16.jpg)
“During 2015, only 15% of Fortune 500 organizations were able to
exploit big data for competitive advantage” – Gartner
“Data maturity of companies is very disparate, and
the most advanced of them start doubting.”
– Christophe Bourguignat
“75 % have invested in Big Data, but only 10% have
projects in production.”
Companies face disillusions. They start asking
questions: I know how much it costs, but how much
do I earn? What is my return on investment?
![Page 17: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/17.jpg)
Machine learning and data science have ( just) reached “peak hype”
![Page 18: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/18.jpg)
The challenges ahead
TALENT PROCESSTOOLS,
FILES,
FEEDS
COMMU-
NICA-
TION
MEA-
SURING
PRIVACY,
COM-
PLIANCE
ETHICS
QUALITY
![Page 19: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/19.jpg)
TALENT“A data scientist is like a gold-coloured unicorn:
mythical powers, but impossible to find”
![Page 20: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/20.jpg)
TALENT“A data scientist is like a gold-coloured unicorn:
mythical powers, but impossible to find”
Programmer
![Page 21: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/21.jpg)
TALENT Or a spider with 25 legs?
![Page 22: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/22.jpg)
Data science as a straight through process?PROCESS
Adhering to a data science workflow is A-OK:
• CRISP-DM
• The KDD process
• SEMMA
• BinaryEdge
![Page 23: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/23.jpg)
Data science as a straight through process?PROCESS
Data
Selection Cleaning Transformation DiscoveryInterpretation/
Evaluation
Selected Data
Cleaned/Processed
Data
Transformed Data
Mined Model/Patterns
Knowledge/Insights
![Page 24: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/24.jpg)
Not really...PROCESS
Data
Selection Cleaning Transformation DiscoveryInterpretation/
Evaluation
Selected Data
Cleaned/Processed
Data
Transformed Data
Mined Model/Patterns
Knowledge/Insights
![Page 25: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/25.jpg)
More like a loopPROCESS
![Page 26: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/26.jpg)
Experiments can take a while…PROCESS
![Page 27: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/27.jpg)
These things are hardPROCESS
• How to create a sense of urgency?
• What does it mean to be finished?
• You can’t predict the future.
![Page 28: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/28.jpg)
Throw it over the wall projectsCOMMU-
NICA-
TION
![Page 29: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/29.jpg)
Throw it over the wall projectsCOMMU-
NICA-
TION
I want to put this GBM into production,
though some steps are done using R and SAS
Anyone know what this XGBoost thing is?Why aren’t we
deployed yet? We have all this data, why can’t
we find interesting customers?
![Page 30: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/30.jpg)
Talking helpsCOMMU-
NICA-
TION
• Learn each other’s language
• Think with your business hat
• Teach semantics (why a shorter lead list is not easier
to produce)
• Convert hard problems into simpler ones
• Use examples, methaphors, analogies
• Show them and show them often
• IT and data science can live together
![Page 31: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/31.jpg)
“Not everything that counts can be counted…
and not everything that can be counted counts”MEA-
SURING
• Show before and after
• “When are you happy?”
• Accept failures
• Manual measuring can be a good thing• Hard to automate subjective feelings…
![Page 32: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/32.jpg)
“No one ever got fired for installing Hadoop on a
cluster… right?”
TOOLS,
FILES,
FEEDS
![Page 33: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/33.jpg)
A fool can ask more questions in an hour than a
wise man can answer in a hundred years
TOOLS,
FILES,
FEEDS
• Focus on the files
• What are we going to use it for?
A data scientist can find, love, and ditch more
tools/libraries/… in an hour than a procurement
officer can vet in a hundred years
![Page 34: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/34.jpg)
Focus on feeds, files, dataTOOLS,
FILES,
FEEDS
• Let them (us) own the data
• Ship fast, ship often
• Focus on format and storage standards, not on
technology:
“Can I get information on X for months A and B with only those
columns that changed?”
... “Can I get it myself?”
• Where’s your golden data set?
• Trust your experts
![Page 35: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/35.jpg)
Technology moves too fast anyway…TOOLS,
FILES,
FEEDS
• HDFS?
• What about HFD5, or Kudo?
• Do we even have unstructured data?
• Do we know what to do with it?
• V’s of Big Data – yeah right!
• BigSQL, or Hive, or Slurp?
• Cloudera, Hortonworks, Teradata, Oracle, I want Hadoop!?
• What do you mean we need H2O on top of Spark on top of Hadoop? We just installed X
• We did these things before… they weren’t hard then
• True, but…
![Page 36: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/36.jpg)
It’s a difficult balanceTOOLS,
FILES,
FEEDS
![Page 37: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/37.jpg)
The wall of deployementTOOLS,
FILES,
FEEDS
• Versioning
• Collaboration
• Scalable execution
• Multiple language support
• Multiple kernel support
• Monitoring
• Scheduling
• Acyclic dependency graphs
• Quite different from playing in a notebook• Vendors are starting to help out
• SAS, SPSS, Domino Data Labs, sense.io, ScienceOps
<-> Jupyter, Rodeo, Your 3GB PIP packages
• Not familiar both to most data scientists (too messy) and IT shops (too
unfamiliar)
![Page 38: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/38.jpg)
• Can new hires get set up in the environment to run analyses on their first day?
• Can data scientists utilize the latest tools/packages without help from IT?
• Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
• Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
• Does collaboration happen through a system other than email or copying files?
• Can predictive models be deployed to production without custom engineering or infrastructure work?
• Is there a single place to search for past research and reusable data sets, code, etc?
• Do your data scientists use the best tools money can buy?
Source: https://blog.dominodatalab.com/joel-test-data-science/
The “Joel Test” for Data ScienceTOOLS,
FILES,
FEEDS
![Page 39: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/39.jpg)
Garbage in…QUALITY
“This model is gonna be great!”
![Page 40: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/40.jpg)
Sometimes they are…QUALITY
• Really: everyone has bad data• But: more “bad” means more time
• Do make sure to get a continuous source
to the “bad” data
• Survey: 50+ banks participating world-wide• Most banks indicated that between 10–20 percent of their data suffer from data
quality problems
• Manual data entry is one of the key problems
• Diversity of data sources and consistent corporate wide data representation the
main challenges for data quality
• Regulatory compliance is the key motive to improve data quality
![Page 41: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/41.jpg)
Oh boy…
• Datensparsamkeit
• Cookie law
• Basel II / III
• Who knows where the cloud is anyways?
• EU directives outdated
• “It’s all on Facebook anyway”
PRIVACY,
COM-
PLIANCE
![Page 42: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/42.jpg)
Academics are just getting started…PRIVACY,
COM-
PLIANCE
![Page 43: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/43.jpg)
In more ways than one...PRIVACY,
COM-
PLIANCE
![Page 44: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/44.jpg)
“If only we didn’t have to worry about this”PRIVACY,
COM-
PLIANCE
![Page 45: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/45.jpg)
Use it as a competitive
advantage?
PRIVACY,
COM-
PLIANCE
45
https://backchannel.com/an-exclusive-look-at-how-ai-and-machine-learning-work-at-apple-8dbfb131932b#.crky6nt6k
![Page 46: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/46.jpg)
Data science for good?ETHICS
• Can an algorithm be racist? Sexist?
• “Will Predictive Models Outliers Be The New Socially
Excluded?” Companies like DataKind, or Bayes Impact
• Concept of open models
![Page 47: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/47.jpg)
The challenges today
TALENT PROCESSTOOLS,
FILES,
FEEDS
COMMU-
NICA-
TION
MEA-
SURING
PRIVACY,
COM-
PLIANCE
ETHICS
QUALITY
![Page 48: Staying Ahead of the Data Avalanche - · PDF fileStaying Ahead of the Data Avalanche ... business data mining and analytics, ... Selection Cleaning Transformation Discovery](https://reader031.fdocuments.in/reader031/viewer/2022011723/5a9daa8e7f8b9a21688d81c5/html5/thumbnails/48.jpg)
Thank you