Stack The Liip Data Science - Netlivemypage.netlive.ch/demandit/files/M_D0861CC4DCEF62DFADC... ·...
Transcript of Stack The Liip Data Science - Netlivemypage.netlive.ch/demandit/files/M_D0861CC4DCEF62DFADC... ·...
–
The Liip Data Science StackInsights from building and maintaining it
Zürich, 08.05.2018
2
About me - Quick facts
Dr. Thomas Ebermann
- Diploma in Computer Science at the Univ. of
Mannheim & Waterloo.
- PhD in Computational Social Science predicting
information flow in Twitter.
- Working for Liip as Data Scientist since 2016.
- Love Ruby and Python.
Purpose over profitsTrust over controlPractice over theoryRisk over safety Flexibility over strengthOpen over closed Compasses over maps
LIIP PRINCIPLES
3
4
The Data Science Stack
History
5
– All github stars– All my bookmarks mobile and mac– Email / newsletters– Internal company slack
– Collect all the data science tools that, I use on a regular basis, have emerged on my horizon.
– Finally sort the mess in my head.
The Stack Idea
We use stacks in web dev in various areas, where we describe systems that build on top of each other and work well together:
LAMP Stack (Linux, Apache, Mysql, Php)
Why not have a Data science stack of tools that work well together?
Instead pointing to only one tool lets point to whole families of similar tools.
6
The Data Science Stack
7
Where does the data come from?
Data Sources
How can we analyse it?
Analysis
Are there solutions that can do all in one?
Business Intelligence
How can we clean and transform it?
Data Processing
How can we efficiently store/retrieve/search it?
Database
How can we visualise it?
Visualisation
But wait what about the Gartner reports?
- Very high level
- Only big players
- Very few open source solutions
- No small tools
- Have to sell your soul to get into these magic
quadrants
8
2017 Version
9
The 2017 PDF Poster
- 250 Tools in one poster- Provide orientation like a map- Discover your white spots on the map- Over 30’000 visitors- Over 4’300 downloads worldwide- Over 300 mail signups to be notified for
Version 2
Quite a success but it was out of date the day we created it!
10
Insights
11
Insights Data Sources
- Scrapers (7): Lots of tools and variety, very open source friendly (PhantomJS+Capybara)
- Website Analytics (37): There are surprisingly a lot more tools out there than Google analytics. (Google Analytics)
- Tag Management (6): A lot of competition has emerged since google tag manager (Google Tag Manager)
- Heatmaps (5): Controversial but insightful (Hotjar)
- Mobile Analytics (18): A lot of specialized tools (Google Analytics)
- Social Media (12) : Due to exclusive contracts and harmonization/acquisition there are only a few big cross-platform data providers out there (Brandwatch)
- IoT (8): Marginal role for us now as a data source right now (Ubidot)
12
Insights Data Processing
- ETL (10): Tools for very big scale or Datalakes (TalenD)
- Data Cleaning (3): User friendly tools exist that target not only the data
scientist (Trifacta)
- Alerting & Logging (7): Excellent open source production ready solutions
change the way logs are consumed these days (Graylog)
- Message Queues (20): PubSub (Kafka), Real Time processing on the fly is
the new paradigm (Flink), Apache Foundation very active here
13
Insights Databases
- Databases (43): There is much more than MYSQL vs NoSQL. Graph
databases (Neo4J), time series databases (TimeScaleDB), Key-Value (Redis),
Column-Oriented (Vertica, VoltDB, Exasol)
- Search (20): A lot of good alternatives to Solr exist nowadays (Elastic) and
SaaS is very popular (Algolia)
- Hadoop Ecosystem (13): The whole Zoo of Tools is maturely integrated yet
remains complex (Spark)
14
Insights Analysis
- Deep Learning (21): Huge momentum lots of different frameworks and applications
are popping up (Tensorflow/Keras)
- Statistical software packages (11): The old monoliths are slowly being surpassed by
open source solutions (R, Rapidminer, Orange)
- General ML libraries (24): A myriad of choices for every programming language yet
python remains subjectively the most active one (scikit-learn)
- Computer Vision (9): All big 5 offer Saas solutions, but open source is strong (openCV)
- NLP/Speech recognition (23): Same here (Wit.ai)
- Assistants/Chatbots (15): A lot of promising solutions and frameworks quickly
emerged (Chatfuel)15
Insights Visualization
- General Visualisation (32): Huge number of tools,, stable candidates for
python (seaborn), R (shiny)
- JS visualisation (28): JS libs are popping up every week (D3) :)
- Dashboards (17): Line between BI and dashboards is blurring, not too
many open source solutions available (Plotly)
16
Business Intelligence
- Business Intelligence (46): I thought I knew a couple of alternatives, but
the options are vast and highly competitive. Most solutions are commercial
but good open source solutions are available (Kibana, Tableau). Ask
Gartner :)
- BI on Hadoop(5): Hard to see where the solutions begin and the
architecture ends (Datameer)
- Data Science Platforms (23): The new BI. Combination between the
freedom of Ipython notebooks and solid infrastructure (Datarobot).
Automated ML.
17
2018 Version
18
From PDF to Website
19
http://datasciencestack.liip.ch
Features I
- You can add tools too!- Search
20
Features II
- Internal Liip technology db
integration (Zebra)
- Quarterly Mailing List (keep
busy deciders up to date)
- JSON Export
21
Insights 2018
22
Outlook
23
Whats next?
Assessment of Tools
- Adopt: We feel strongly that the industry should
be adopting these items. We use them when
appropriate on our projects.
- Trial: Worth pursuing. It is important to
understand how to build up this capability.
Enterprises should try this technology on a
project that can handle the risk.
- Asses: Worth exploring with the goal of
understanding how it will affect your enterprise.
- Hold: Proceed with caution.
24
Solid rucksack
Data Sources: Google Analytics
Processing: Trifacta
Analysis: Scikit-Learn
Visualization: Highcharts(JS), Shiny(R), Seaborn(python)
Business Intelligence: KNIME
25
Trendy rucksack
Sources: Chartbeat or Snowplow
Processing: Fluentd
Analysis: Keras
Visualization: Plotly
Business Intelligence: Data Robot or Dataiku
26
27
152 employees5 locations1 vision
Tuesday 10:21
St. GallenZürich
Bern
Fribourg
Lausanne
Data Services @ Liip
28
Virtual Assistants
• Chatbots and Assistants
Data Solutions
• Recommender Systems
• Computer Vision
• Speech Recognition
• Integrated ML Models
• Whole Web-apps / apps
Data Science / Consulting
• From Data to Insights
• Data Analysis
• Network Analysis (SNA)
• Social Graph
• Time Series
• Machine Learning
Data Visualization
• Data Visualization
• Geo Visualization
• Data-Modeling
• Real Time
DashboardingBig Data
• Storage (Hadoop)
• And Analysis (Spark)
• Data Streams (Kafka)
Open Data
• Infrastructure (CKAN)
• Linked Data
Data-Driven User Experiences
• Data Interfaces
• Conversational Design
Mobile AI
• CoreML
Thank you!Excited to hear your questions.
Dr. Thomas Ebermann
Data Scientist
30
Scrapers (7) Website Analytics(37) Social Media (12)
Tag Management (6) Mobile Analytics (18) Heatmaps (5) IoT (8)
Insights Data Sources