Stack The Liip Data Science - Netlivemypage.netlive.ch/demandit/files/M_D0861CC4DCEF62DFADC... ·...

Post on 24-May-2020

4 views 0 download

Transcript of Stack The Liip Data Science - Netlivemypage.netlive.ch/demandit/files/M_D0861CC4DCEF62DFADC... ·...

The Liip Data Science StackInsights from building and maintaining it

Zürich, 08.05.2018

2

About me - Quick facts

Dr. Thomas Ebermann

- Diploma in Computer Science at the Univ. of

Mannheim & Waterloo.

- PhD in Computational Social Science predicting

information flow in Twitter.

- Working for Liip as Data Scientist since 2016.

- Love Ruby and Python.

Purpose over profitsTrust over controlPractice over theoryRisk over safety Flexibility over strengthOpen over closed Compasses over maps

LIIP PRINCIPLES

3

4

The Data Science Stack

History

5

– All github stars– All my bookmarks mobile and mac– Email / newsletters– Internal company slack

– Collect all the data science tools that, I use on a regular basis, have emerged on my horizon.

– Finally sort the mess in my head.

The Stack Idea

We use stacks in web dev in various areas, where we describe systems that build on top of each other and work well together:

LAMP Stack (Linux, Apache, Mysql, Php)

Why not have a Data science stack of tools that work well together?

Instead pointing to only one tool lets point to whole families of similar tools.

6

The Data Science Stack

7

Where does the data come from?

Data Sources

How can we analyse it?

Analysis

Are there solutions that can do all in one?

Business Intelligence

How can we clean and transform it?

Data Processing

How can we efficiently store/retrieve/search it?

Database

How can we visualise it?

Visualisation

But wait what about the Gartner reports?

- Very high level

- Only big players

- Very few open source solutions

- No small tools

- Have to sell your soul to get into these magic

quadrants

8

2017 Version

9

The 2017 PDF Poster

- 250 Tools in one poster- Provide orientation like a map- Discover your white spots on the map- Over 30’000 visitors- Over 4’300 downloads worldwide- Over 300 mail signups to be notified for

Version 2

Quite a success but it was out of date the day we created it!

10

Insights

11

Insights Data Sources

- Scrapers (7): Lots of tools and variety, very open source friendly (PhantomJS+Capybara)

- Website Analytics (37): There are surprisingly a lot more tools out there than Google analytics. (Google Analytics)

- Tag Management (6): A lot of competition has emerged since google tag manager (Google Tag Manager)

- Heatmaps (5): Controversial but insightful (Hotjar)

- Mobile Analytics (18): A lot of specialized tools (Google Analytics)

- Social Media (12) : Due to exclusive contracts and harmonization/acquisition there are only a few big cross-platform data providers out there (Brandwatch)

- IoT (8): Marginal role for us now as a data source right now (Ubidot)

12

Insights Data Processing

- ETL (10): Tools for very big scale or Datalakes (TalenD)

- Data Cleaning (3): User friendly tools exist that target not only the data

scientist (Trifacta)

- Alerting & Logging (7): Excellent open source production ready solutions

change the way logs are consumed these days (Graylog)

- Message Queues (20): PubSub (Kafka), Real Time processing on the fly is

the new paradigm (Flink), Apache Foundation very active here

13

Insights Databases

- Databases (43): There is much more than MYSQL vs NoSQL. Graph

databases (Neo4J), time series databases (TimeScaleDB), Key-Value (Redis),

Column-Oriented (Vertica, VoltDB, Exasol)

- Search (20): A lot of good alternatives to Solr exist nowadays (Elastic) and

SaaS is very popular (Algolia)

- Hadoop Ecosystem (13): The whole Zoo of Tools is maturely integrated yet

remains complex (Spark)

14

Insights Analysis

- Deep Learning (21): Huge momentum lots of different frameworks and applications

are popping up (Tensorflow/Keras)

- Statistical software packages (11): The old monoliths are slowly being surpassed by

open source solutions (R, Rapidminer, Orange)

- General ML libraries (24): A myriad of choices for every programming language yet

python remains subjectively the most active one (scikit-learn)

- Computer Vision (9): All big 5 offer Saas solutions, but open source is strong (openCV)

- NLP/Speech recognition (23): Same here (Wit.ai)

- Assistants/Chatbots (15): A lot of promising solutions and frameworks quickly

emerged (Chatfuel)15

Insights Visualization

- General Visualisation (32): Huge number of tools,, stable candidates for

python (seaborn), R (shiny)

- JS visualisation (28): JS libs are popping up every week (D3) :)

- Dashboards (17): Line between BI and dashboards is blurring, not too

many open source solutions available (Plotly)

16

Business Intelligence

- Business Intelligence (46): I thought I knew a couple of alternatives, but

the options are vast and highly competitive. Most solutions are commercial

but good open source solutions are available (Kibana, Tableau). Ask

Gartner :)

- BI on Hadoop(5): Hard to see where the solutions begin and the

architecture ends (Datameer)

- Data Science Platforms (23): The new BI. Combination between the

freedom of Ipython notebooks and solid infrastructure (Datarobot).

Automated ML.

17

2018 Version

18

From PDF to Website

19

http://datasciencestack.liip.ch

Features I

- You can add tools too!- Search

20

Features II

- Internal Liip technology db

integration (Zebra)

- Quarterly Mailing List (keep

busy deciders up to date)

- JSON Export

21

Insights 2018

22

Outlook

23

Whats next?

Assessment of Tools

- Adopt: We feel strongly that the industry should

be adopting these items. We use them when

appropriate on our projects.

- Trial: Worth pursuing. It is important to

understand how to build up this capability.

Enterprises should try this technology on a

project that can handle the risk.

- Asses: Worth exploring with the goal of

understanding how it will affect your enterprise.

- Hold: Proceed with caution.

24

Solid rucksack

Data Sources: Google Analytics

Processing: Trifacta

Analysis: Scikit-Learn

Visualization: Highcharts(JS), Shiny(R), Seaborn(python)

Business Intelligence: KNIME

25

Trendy rucksack

Sources: Chartbeat or Snowplow

Processing: Fluentd

Analysis: Keras

Visualization: Plotly

Business Intelligence: Data Robot or Dataiku

26

27

152 employees5 locations1 vision

Tuesday 10:21

St. GallenZürich

Bern

Fribourg

Lausanne

Data Services @ Liip

28

Virtual Assistants

• Chatbots and Assistants

Data Solutions

• Recommender Systems

• Computer Vision

• Speech Recognition

• Integrated ML Models

• Whole Web-apps / apps

Data Science / Consulting

• From Data to Insights

• Data Analysis

• Network Analysis (SNA)

• Social Graph

• Time Series

• Machine Learning

Data Visualization

• Data Visualization

• Geo Visualization

• Data-Modeling

• Real Time

DashboardingBig Data

• Storage (Hadoop)

• And Analysis (Spark)

• Data Streams (Kafka)

Open Data

• Infrastructure (CKAN)

• Linked Data

Data-Driven User Experiences

• Data Interfaces

• Conversational Design

Mobile AI

• CoreML

Thank you!Excited to hear your questions.

Dr. Thomas Ebermann

Data Scientist

thomas.ebermann@liip.ch

30

Scrapers (7) Website Analytics(37) Social Media (12)

Tag Management (6) Mobile Analytics (18) Heatmaps (5) IoT (8)

Insights Data Sources