End-to-End data mining feature integration, transformation...

6
© 2014 Datameer, Inc. All rights reserved. End-to-End data mining feature integration, transformation and selection with Datameer

Transcript of End-to-End data mining feature integration, transformation...

Page 1: End-to-End data mining feature integration, transformation ...files.meetup.com/18369817/SBDUG17-Datameer-Feature_Selection.pdf · • Datameer enables SMEs and data scientists to

© 2014 Datameer, Inc. All rights reserved.

End-to-End data mining feature integration, transformation and selection with Datameer

Page 2: End-to-End data mining feature integration, transformation ...files.meetup.com/18369817/SBDUG17-Datameer-Feature_Selection.pdf · • Datameer enables SMEs and data scientists to

© 2014 Datameer, Inc. All rights reserved.

Fastest time to Insights

Rapid Data Integration   Zero coding data integration   Wizard-led data integration & No ETL

required   Over 55+ out-of-the-box adapters

OpenAPI to create custom data connections   Schema on read   Flexible integration methods   Exception Reporting

Rapid Feature Transformation   Point & Click Analytics   Spreadsheet UI   270+ pre-built functions   Visual data profiling   Drag & Drop Visualization

Powerful Feature Selection   Out-of-the-box Data Mining on

Hadoop (Decision Trees, Column Dependency, …, Pearson, Spearman, …)

  Reuse of own functions written in Java, R, Python, SAS, SPSS and more

Feature discovery, selection and data mining on BigData within a fraction of time

Page 3: End-to-End data mining feature integration, transformation ...files.meetup.com/18369817/SBDUG17-Datameer-Feature_Selection.pdf · • Datameer enables SMEs and data scientists to

© 2014 Datameer, Inc. All rights reserved.

ProblemIt takes months to integrate, pre-process, merge and select data from a wide range of data sources for the purpose of data mining in the area of credit scoring.

This is due toTechnical Challenges •  Large number of source systems• Heterogeneous data formats •  Large data volume• Evolving systems lead to long integration

processes

Organizational Challenges• Many alignment round trips between SMEs and

IT to get the right data in the right form•  Intermediate insights lead to changing

requirements, which in turn again trigger change requests at IT

• All data from the different sources is ingested into a Hadoop-based data lake in their original format following the pattern “store everything, discover later”• Datameer enables SMEs and data scientists to

merge data from many data sources.• Comprehensive & easy to use data transformation

functionalities help to understand and clean up data quickly.•  Feature selection functions allow to spot

relationships in data sets and reduce thousands of attributes to a couple of hundred or even less depending on the use case.• Different sampling techniques are applied to

extract data for the purpose to create predictive models in SAS• Datameer’s PMML interfaces allows to run those

created predictive models on Big Data to get more precise rules.

Solution

Page 4: End-to-End data mining feature integration, transformation ...files.meetup.com/18369817/SBDUG17-Datameer-Feature_Selection.pdf · • Datameer enables SMEs and data scientists to

© 2014 Datameer, Inc. All rights reserved.

Results• Datameer reduces the process of data integration, feature

transformation and selection from months to merely days.

• Datameer eliminated the overhead processes between IT and business units

• SMEs and data scientist can utilize Datameer as a self service platform for data discovery without going back and forth between IT with ever changing requirements. 

• Predictive data mining now delivers better results as models can now be run on Big Data

Page 5: End-to-End data mining feature integration, transformation ...files.meetup.com/18369817/SBDUG17-Datameer-Feature_Selection.pdf · • Datameer enables SMEs and data scientists to

© 2014 Datameer, Inc. All rights reserved.

Feature Selection & Discovery

Feature Transformation!

Feature Selection!

Modeling!

Prediction, Scoring, …!

•  Data Cleansing•  Vari. histogram distributions•  Reduce cardinality•  Binning•  …

•  Pearson•  Spearman•  Mutual Information•  Gini•  …

•  Regression•  Neural Network•  Bayesian Networks•  …

•  PMML •  Ensemble

3rd party tools for modeling sampled data and Datameer for executing models on BigData

Page 6: End-to-End data mining feature integration, transformation ...files.meetup.com/18369817/SBDUG17-Datameer-Feature_Selection.pdf · • Datameer enables SMEs and data scientists to

© 2014 Datameer, Inc. All rights reserved.

Solution Architecture Blueprint

DB!

Import Adapters or Data Links!

Workbooks!

Data !Sources!

Data !Sources! …!

Hive!

Workbooks!Filtering, Aggregation, Joins

Visualization

Option: Write results to database and import to mining tools to build

models

Option: Export CSV or write to Hive Table for Data Mining Tools

Export PMML to Datameer

Workbooks!