Sympathy for data

27
Sympathy for Data Creating FOSS in an enterprise environment Stefan Larsson Combine AB E-mail: [email protected] Twitter: @lastsys FSCONS 2014 / 2014-11-01

description

Presentation of the tool Sympathy for Data at FSCONS 2014 in Gothenburg, Sweden.

Transcript of Sympathy for data

Page 1: Sympathy for data

Sympathy for DataCreating FOSS in an enterprise environment

Stefan Larsson Combine AB

!E-mail: [email protected] Twitter: @lastsys

FSCONS 2014 / 2014-11-01

Page 2: Sympathy for data

Outline

• Background and problem description

• Technology overview

• Demonstration

• Future and conclusion

Page 3: Sympathy for data

Background and Problem Description

Page 4: Sympathy for data

Spreading local innovation is difficult in a large organization

Management

Unit 1 Unit 2

Dept 2.1

Section 2.1.1

Group 2.1.1.1 Group 2.1.1.2

Section 2.1.2

Group 2.1.2.1 Group 2.1.2.2

Dept 2.2

Section 2.2.1

Group 2.2.1.1 Group 2.2.1.2

Section 2.2.2

Group 2.2.2.1 Group 2.2.2.2

Dept 2.3

Unit 3

Employee Employee

Page 5: Sympathy for data

In 2009 we started coding during evenings and weekends

Ensure ownership! or

Make an agreement with your employer first!

Page 6: Sympathy for data

We decided to ask our employer for funding through paid time

Selling Arguments

Company Lawyers

Maintenance Ensure Function

OwnershipCode Contribution

Warranty and Responsibility

Page 7: Sympathy for data

”Big Data” is a recent marketing gimmick, engineers have lived with it for decades

Issue Details

Volume Storage, memory and distribution.

Velocity Rapid results from data and data generation rate.

Variety Many different data sources and data structures.

Veracity Truth or accuracy of data.

Page 8: Sympathy for data

Business Intelligence

Data Science

Business Intelligence evolving into Data Science

Busi

ness

Va

lue

Time

Low

Past Future

Hig

h

Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013

Forward thinking

Retrospective

Page 9: Sympathy for data

It is easy to get stuck in ”why”

Busi

ness

Va

lue

Analytics Sophistication

Low

Reporting Action

Hig

h

Analysis

What should I do next? !What result should I expect? !What if trends continue? !Why did this happen?!!How did we do? !How many, how often, where?

Redrawn from ”Big Data - Understanding how data powers big business” by Bill Schmarzo, Wiley, 2013

Page 10: Sympathy for data

”Data Science” can be much more complex than BI

Unstructured Data Sources

Unstructured Data Sources

Unstructured Data Sources

ELT Analyis / Modelling

Report / Prediction Action

Well Formed Data Source ETL Analyze Report

Business Intelligence

Data Science!!!

Page 11: Sympathy for data

Engineers are usually not software developers, but can have great scripting skills

Data 1

Data 2

Data 3

Data import script

File

Clean and group data script

File File

Analyze data script

Visualize / report result script

File

80-90% of the workConclusions / Actions

LoadExtract Transform

Page 12: Sympathy for data

Those engineers who are uncomfortable with writing scripts tend to use Microsoft Excel for everything

Data 1

Data 2

Data 3

Excel

Copy/Paste

Mouse

Manual labor

Keyboard

Result

No reader

No reader

Page 13: Sympathy for data

With independent work the individual data formats are often incompatible

Data 1

Data 2

Data 3

Data import Clean and group data Analyze data Visualize / report

result

Data import Clean and group data Analyze data Visualize / report

result

Clean and group data Analyze data Visualize / report

result

Engineer 1

Engineer 2

Engineer 3

Data import

80-90% of the work

Page 14: Sympathy for data

Well defined data formats at inputs and outputs of operations simplifies reuse of scripts

Data 1

Data 2

Data 3

Analyze data

Data import Clean and group data Analyze data Visualize / report

result

Analyze data

Engineer 1

Engineer 2

Engineer 3

80-90% of the work

Page 15: Sympathy for data

The Pareto Principle states that 20% of the work solves 80% of the problem, we are

attacking the ELT-problem

Basic Requirement Advantage Challenge

Isolated execution environment. Guarantee functionality. Design environment(s).

Data type system for inputs and outputs. Well defined data. Design type system.

Library of reusable operations.

Saving time and improving quality of operations. Granularity of operations.

Graphical editor to build data flow graphs

No coding knowledge required for user.

Visualization and user interaction concepts.

Page 16: Sympathy for data

The Result Became ”Sympathy for Data”

Page 17: Sympathy for data

Technology Overview

Page 18: Sympathy for data

The platform is based on Python

• Python 2.7 with NumPy and SciPy as a foundation.!

• Easy for Matlab users to convert.

• Plenty of computational and plotting libraries to choose from.

• HDF5 for storage of intermediate data.!

• Easy to read subsets of data.

• User Interface: PySide (Qt)!

• Started in C++ but switched to Python for faster development rate.

• No feedback loops in flows, just list recursion.!

• Type system since tables are not enough.

Page 19: Sympathy for data

We work with text and tables in combination with containers

Data Containers

Text

Table

List

Record (Named Tuple)

Dictionary (String Keys)in the future: image, sound, etc.

Page 20: Sympathy for data

Example of typestype1: (desc: text, data: [table], prop: { (f1: text, f2: table) })

type2: (desc: text, content: [type1])

Record with fields ’desc’, ’data’ and ’prop’.

type1 is referred to in type 2.

Page 21: Sympathy for data

We are using separate worker processes for each block

Scheduler

Worker 1 Worker 2 Worker 3 Worker 4

Page 22: Sympathy for data

Demonstration

Page 23: Sympathy for data

Future and Conclusion

Page 24: Sympathy for data

To sum up, Sympathy for Data was born since nothing fulfilled our needs• Existing solutions found on the market only works with

well-formed tables.

• Evaluated software requires data to be preprocessed.

• Faster and cheaper to adapt our own platform for our needs.

• Many engineers are not ”multi-instrumentalists”.

• And of course; personal interest and commitment.

Page 25: Sympathy for data

Sympathy for Data is currently powering several customer applications

• Automation of manual ELT-workflows with heterogeneous data sources.

• Failure/warranty prediction.

• Replacing existing outdated Matlab-scripts.

Page 26: Sympathy for data

And recycling code between applications is working well…

Page 27: Sympathy for data

We still need to work on some important areas

• Mature development environment for blocks.

• Improve support for interactive work.

• Clean up library with ”Any”-type.

• Introduce type for functions.

• Higher-order functions — develop for singular case, scale to plural.

• Improve performance.

• Polish, polish, polish… The software is still quite rough.