STRATIO BIG DATA SCIENCE PLATFORM - Meetupfiles.meetup.com/17637122/Stratio Big Data Science...
Transcript of STRATIO BIG DATA SCIENCE PLATFORM - Meetupfiles.meetup.com/17637122/Stratio Big Data Science...
© Stratio 2015. Confidential, All Rights Reserved.
Library Single-
Machine
Distributed Distributed
Graph
Algorithms
Visualization IDE Spark
Integration
Hadoop
Integration
Community
Spark:
Mllib+GraphX
++ ++++ ++++ - ++ ++++ +++ ++++
R ++++ + - ++++ ++++ ++ + ++++
Scikit-learn ++++ ++ - ++++ +++ ++ + +++
H2O + +++ - +++ ++ +++ +++ ++
Apache Mahout ++ +++ - - + ++ +++ +
Apache
SystemML
++ +++ - - ++ +++ +++ ++
There is no library that provides all the features in a good or very good level
© Stratio 2015. Confidential, All Rights Reserved.
Spark: Mllib + GraphX
Nombre algoritmo Single
Machine
Distributed
Classification and Regression
Linear Support Vector Machines (SVMs) X X
Logistic regression X X
Linear least squares X X
Lasso L1 regularization X X
Ridge regression X X
Streaming linear regression X X
Isotonic regression X X
Decision Trees X X
Ensembles of decision trees (Random forests, Gradient-boosted trees) X X
Naive Bayes (Multinomial naive Bayes, Bernoulli naive Bayes) X X
Isotonic regression X X
Collaborative Filtering
Alternating least squares (ALS) X X
Clustering
K-means X X
Gaussian mixture X X
Power iteration clustering (PIC) (using GraphX as its backend) X X
Latent Dirichlet allocation (LDA) X X
Streaming k-means X X
Dimensionality Reduction
Singular value decomposition X X
Principal component analysis X X
Graph
PageRank X X
Closeness Centrality X X
Betweenness Centrality X X
Triangle Counting X X
Connected Components X X
Strongly Connected Components X X
© Stratio 2015. Confidential, All Rights Reserved.
Spark: Mllib + GraphX
Frequent Pattern Mining
FP-growth X X
Association rules X X
PrefixSpan X X
Feature Extraction and Transformation
Term frequency-inverse document frequency (TF-IDF) X X
Word2Vec X X
StandardScaler X X
Normalizer X X
Feature selection (ChiSqSelector) X X
ElementwiseProduct X X
© Stratio 2015. Confidential, All Rights Reserved.
R
Nombre algoritmo Single
Machine
Distributed
Classification and Regression
Linear Support Vector Machines X
Penalized SVM X
Outliers X
Decision Trees X
Ridge regression X
Naïve Bayes X
Adaboost X
JRip X
…... X
Collaborative Filtering
Alternating least squares (ALS) X
Clustering
K-means X
Hybrid Hierarchical Clustering X
Expectation Maximization (EM) X
Dissimilarity Matrix Calculation X
Hierarchical Clustering X
Bayesian Hierarchical Clustering X
Density-Based Clustering X
K-Cores X
...
Dimensionality Reduction
Singular value decomposition X
Principal component analysis X
Feature Selection X
... X
Frequent Pattern Mining
FP-growth X
arulesNBMiner X
The Apriori Algorithm X
The Eclat Algorithm X
... X
Feature Extraction and Transformation
© Stratio 2015. Confidential, All Rights Reserved.
Scikit-learn
Nombre algoritmo Single
Machine
Distributed
Classification and Regression
Stochastic Gradient Descent X X
Approximate nearest neighbor X X
Locality Sensitive Hashing Forest (LSH) X X
SVM X X
Gaussian Naive Bayes X X
Multinomial Naive Bayes X X
Bernoulli Naive Bayes X X
Logistic Regression X X
Ridge Regression X
Lasso X
Elastic Net X
Multi-task Lasso X
Least Angle Regression X
LARS Lasso X
Orthogonal Matching Pursuit (OMP) X
Bayesian Regression X
…. X
Clustering
K-means X X
Affinity propagation X
Mean-shift X
Spectral clustering X
Ward hierarchical clustering X
Agglomerative clustering X
DBSCAN X
Gaussian mixtures X
Birch X
Dimensionality Reduction
Incremental PCA X
Kernel PCA X
Truncated singular value decomposition and latent semantic
analysis
X
Sparse coding with a precomputed dictionary X
Generic dictionary learning X
Factor Analysis X
Independent component analysis X
Non-negative matrix factorization (NMF or NNMF) X
Latent Dirichlet Allocation (LDA) X
... X
© Stratio 2015. Confidential, All Rights Reserved.
Scikit-learn
Frequent Pattern Mining
FP-growth X
arulesNBMiner X
The Apriori Algorithm X
The Eclat Algorithm X
... X
Feature Extraction and Transformation
Standardization, or mean removal and variance
scaling
X
Normalization X
Binarization X
Encoding categorical features X
Imputation of missing values X
Generating polynomial features X
Custom transformers X
Grid Search X
Cross Validation
K-Fold X X
Leave-One-Out - LOO X
Random permutations cross-validation a.k.a. Shuffle
& Split
X
... X
© Stratio 2015. Confidential, All Rights Reserved.
H2O
Nombre algoritmo Single
Machine
Distributed
Classification and Regression
Generalized Linear Models X X
Distributed Random Forest X X
Naive Bayes X X
Gradient Boosted Regression X X
Gradient Boosted Classification X X
Clustering
K-means X X
Dimensionality Reduction
Principal component analysis X X
Feature Extraction and Transformation X X
Grid Search X X
Deep Learning X X
© Stratio 2015. Confidential, All Rights Reserved.
Apache Mahout
Nombre algoritmo Single
Machine
Distributed
Classification and Regression
Random Forest X X*
Naïve Bayes X X*
Hidden Markov Models X
Multilayer Perceptron X
Logistical Regression X
Collaborative Filtering
Alternating least squares (ALS) X X*
Clustering
K-means X X*
Fuzzy K-Means X X
Streaming K-Means X X*
Spectral Clustering X
Dimensionality Reduction
Singular value decomposition X X*
Principal component analysis X X*
Lanczos Algorithm X X
QR Decomposition X X
Feature Extraction and Transformation X X
© Stratio 2015. Confidential, All Rights Reserved.
Apache SystemML
Nombre algoritmo Single
Machine
Distributed
Classification and Regression
Multinomial Logistic Regression X X
Binary-Class Support Vector Machines X X
Multi-Class Support Vector Machines X X
Naive Bayes X X
Decision Trees X X
Random Forests X X
Linear Regression X X
Stepwise Linear Regression X X
Generalized Linear Models X X
Stepwise Generalized Linear Regression X X
Regression Scoring and Prediction X X
Clustering
K-means X X
Dimensionality Reduction
Matrix Completion via Alternating Minimizations X X
Principal component analysis X X
Feature Extraction and Transformation X X
© Stratio 2015. Confidential, All Rights Reserved.
DATA SCIENCE BIG DATA PLATFORM
•Integration of different libraries Open Source with distributed machine learningalgorithms
•Development environment for every data scientist
•Making real-time decisions based on the models learned by machine learning algorithms
•Integrated with all components of the Stratio Big Data Platform
•Full management of the knowledge life cycle
© Stratio 2015. Confidential, All Rights Reserved.
Roman Martin
DATA SCIENCE BIG DATA PLATFORM
© Stratio 2015. Confidential, All Rights Reserved.
Milestones
Machine learning
life cycle with Big
Data
Catalog of
distributed machine
learning algorithms
+30 Big Data
components (ingestion, data
stores, real-time, visualization,
notebook)
74distributed machine
learning algorithms
Low learning curve for
a data scientist
integrated with 4 Data
Science development
environments (IPython, Spark,
Java)
© Stratio 2015. Confidential, All Rights Reserved.
MACHINE LEARNING(LEARN FROM THE PAST TO PREDICT THE FUTURE)
CATALOG OF DISTRIBUTED MACHINE LEARNING ALGORITHMS
© Stratio 2015. Confidential, All Rights Reserved.
Classification &
Regression
Recommendation
Graph
.
.
.
ML Distributed Algorithms
Catalog
RStudio
Development Environment
of Data Science
iPython
Java & Scala
StratioML
R
Python
Java & Scala
© Stratio 2015. Confidential, All Rights Reserved.
Catalago of Distributed Machine Learning Algorithms
Types of Algorithms Number of Algorithms
Classification and Regression 37
Collaborative Filtering 2
Clustering 10
Dimensionality Reduction 7
Graph 6
Frequent Pattern Mining 3
Feature Extraction and
Transformation
7
Cross Validation 1
Deep Learning 1
TOTAL 74
The catalog of machine learning
distributed algorithms is based on the
main open source machine learning
libraries available:
● Apache MLlib
● Apache Graphx
● Sparkit-learn
● H2O
● System ML
● Mahout
© Stratio 2015. Confidential, All Rights Reserved.
StratioML
• Integration with the datastores of the platform:
☑ HDFS, Parquet, HIVE
☑ Cassandra, MongoDB, ElasticSearch
☑ Stratio Postgres Big Data
• Integration of different distributed machine learning algorithms libraries with various programminglanguages used by data scientists:
☑ Python
☑ Scala
☑ Java
☑ R
Contains the components to provide to the data scientist capabilities to use the Stratio
Big Data Platform and the catalog of distributed machine learning algorithms
© Stratio 2015. Confidential, All Rights Reserved.
CATÁLOGO DE ALGORITMOS MACHINE LEARNING DISTRIBUIDOS STRATIO CUSTOMER INTELLIGENCE
© Stratio 2015. Confidential, All Rights Reserved.
Stratio Customer Intelligence
The main difficulty with a segmentation problem, profiling or Recommendation is to define the criteria, or data, based on which we will carry out such profiling. In Stratio Big Data Science Platform always we work based on these criteria:
➔ Profiling should always be performed based on a particular context, user behavior, user reviews, connections, etc.
➔ The result of profiling based on behavior is a very good source of knowledge from which to make recommendations associated with that behavior.
“Knowledge is neither created nor destroyed, it is transformed”
© Stratio 2015. Confidential, All Rights Reserved.
Stratio Customer Intelligence
In Stratio Customer Intelligence, we have developed a set of algorithms makers learn from customer behavior and make recommendations based on that behavior.
1. Clustering users based on their behaviorClustering generated from the intrinsic information for each user based on their behavior. Clustering users based on these types of behavior:
➔ Quantitative data expressing opinion (rating, number of views, clicks, etc.)➔ Sequence behavior (navigation, actions, etc.)➔ Relationship with other elements (network of friends, networking, consumed products)
2. User classification (Stratio Profiler)The next step is to connect automatically with each user defined for this cluster using the features / own labels.
3. Recommendation System (Stratio Recommender)With information generated we recommend making a ranking of the predictions based on the tastes of each cluster adapting to the target user
© Stratio 2015. Confidential, All Rights Reserved.
Stratio Customer Intelligence: User Clustering
The following is an example of clustering based on user ratings on movies (the behavior of the user)
© Stratio 2015. Confidential, All Rights Reserved.
Stratio Customer Intelligence: User Profiling
In this moment all users are assigned a cluster. The classifier seeks to relate the characteristics of default label, along with others who consider the Data Scientist, with own cluster.The system has the inputs:
➔ Categorical characteristics of users➔ Relevant features added➔ Tags allocation given by target cluster
The classifier system generates a new model able to assign users to clusters.
© Stratio 2015. Confidential, All Rights Reserved.
Stratio Customer Intelligence: User Recommendation
Recommendation systems have much more power if developed on sets of users that have similar characteristics. Grouping all users in a cluster, which will give us recommendations group. These recommendations provide a specific user after having crossed its historic.
© Stratio 2015. Confidential, All Rights Reserved.
Stratio Customer Intelligence: Summary
Below is a comparison of the ratio of success of recommendation algorithm using algorithms Stratio Customer Intelligence Vs recommendation algorithm for ALS Spark Movielens DataSet.
DataSet MovieLens:
Number of Ratings: 1.000.209 Number of Users: 6040 Number of Movies: 3706
Number of evaluated recommendations: 114.518
Ratio de Acierto de Recomendación:
Algoritmo Spark ALS
Stratio Recommender Stratio Recommender + Stratio Profiling
33.63 % 41.25 % 57.47 %
© Stratio 2015. Confidential, All Rights Reserved.
IPython - Development Environment on Python
IPython provides:
● A powerful interactive shell
● Developed on Jupyter Notebook
● Integrated with interactive visualization
tools and GUIs:
○ wxPython
○ PyQt4/PySide
○ PyGTK
○ Tk
● Flexible and embeddable by different
interpreters
● Easy to use and optimized tools to
distributed processing
© Stratio 2015. Confidential, All Rights Reserved.
RStudio - Development Environment on R
RStudio is an integrated development
environment (IDE) for R including:
● Console
● Syntax Editor with capabilities of
direct code execution
● logs
● Historical
● Depuration
● Workspace Management
© Stratio 2015. Confidential, All Rights Reserved.
Java & Scala
There are different development
environments for Java and Scala. Two of
the most used are IntelliJ and Eclipse,
these environments provide:
● Syntax Editor with capabilities of
direct code execution
● Historical
● Depuration
● Workspace Management
● Integration with different SCMs
● Remote Debugging
● Integration with Maven and SBT
● Possibility of development of plugins
© Stratio 2015. Confidential, All Rights Reserved.
MACHINE LEARNING FOR DECISIONS IN REAL TIME
MACHINE LEARNING LIFE CYCLE WITH BIG DATA
© Stratio 2015. Confidential, All Rights Reserved.
Machine Learning Life Cycle
The aim is to manage the lifecycle of knowledge. The life cycle requires a recursive real-time automation process of knowledge management with machine learning technology to learn from experience and
anticipate problems.
BIG DATA MACHINE LEARNING DECISION MAKINGALERT GENERATION
© Stratio 2015. Confidential, All Rights Reserved.
Knowledge Life Cycle
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
BIG DATA AND DISTRIBUTED PROCESSING OF THE KNOWLEDGE LIFE-CYCLE IN REAL TIME
Data Enrichment
in real time
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
Ingestion
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Standardized data reception
High performance and operational flexibility
Producers and Consumers decoupled
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
Real Time Monitoring
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Real Time
Monitoring
Monitoring throughout the lifecycle of knowledge
Dashboards monitoring and management Alerts
Viewing correlated Information
© Stratio 2015. Confidential, All Rights Reserved.
Store Raw Data
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Centralized management of the knowledge lifecycle
Learning without going to the source
Historical storage
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
Enrichment, Correlation and Decision making
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Enrichment of data based on rules
Treatment of complex events
Anomaly detection, fraud, incident analysis,etc in real time based on learned models Siddhi
CEP
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
Big Data Multi-persistence
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Decision making
Data queryMachine
Learning
Data Enrichment
in real timel
Multi-Persistence system
Flexibility in consultations
And scalable distributed storage of information.
Update aggregated data
Real Time
Monitoring
Big Data Multi-
persistence
© Stratio 2015. Confidential, All Rights Reserved.
Machine Learning
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Machine Learning Algorithms
Development Environments for Learning Machine
Full cycle analysis of the data.
Viewing the results of the algorithms.
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
Data Query
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Abstraction of the data source
Optimization of the most common queries
Single access interface
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
Information Visualization
Ingestion
Store Raw Data
Information
Visualization
Data Correlation
in Real Time
Data
Normalization in
Real Time
Big Data Multi-
persistenceDecision making
Data queryMachine
Learning
Data Enrichment
in real time
Reporting and dynamic dashboards
Integration with BI tools
Easy to freely explore the data
Real Time
Monitoring
© Stratio 2015. Confidential, All Rights Reserved.
DICTIONARYOF DATA
GIVE MEANING TO DATA = KNOWLEDGE
© Stratio 2015. Confidential, All Rights Reserved.
Dictionary of Data
☑ The centralization of management schemes of information is a key element for proper data governance it must necessarily consider three key issues:
• Registration unified schemes• Evolution and versioning schemes in time• Dictionary fields for a unified mapping.
☑ The absence of a technological piece of this type is a significant risk of loss of control over the serialization / deserialization of stored data, especially when we consider the time factor and the volatility of unstructured data (changing patterns)
☑ It is critical to give consumers the means to manage data structures and proper management of its changes over time.
☑ The solution itself relies on the kindness of Kafka and Avro to resolve their own problems of registration schemes, such as:• Assigning a globally unique ID to each registered scheme.• Reliable and replicated schemes• High-performance distributed architecture
It is necessary to incorporate a semantic layer that serves as a central repository of meta-information.
© Stratio 2015. Confidential, All Rights Reserved.
Decision Making and Real Time Alert Generation
For true knowledge management it is necessary to automate decision-making and alert
generation
STRATIO DECISION
• Managing the flow of knowledge:
☑ Real-time integration of information originated at different times, origins and components..
☑ Detecting patterns of information in real time.
• Rule management:
☑ User configurable rules that automate real-time decision making and alerts generation.
• Integration with machine learning algorithms:
☑ Rules decision makers and generate alerts based on predictive machine learning algorithms.
© Stratio 2015. Confidential, All Rights Reserved.
CENTRAL INFORMATION VISUALIZATIONNOT TO STAY ON THE SURFACE
© Stratio 2015. Confidential, All Rights Reserved.
CENTRAL INFORMATION VISUALIZATION
The platform allows you to visualize the entire knowledge life-cycle and providing different
views of information based on the consumer
STRATIO VIEWER
• Apply knowledge:
☑ Aggregates generated in real time
☑ Informes analíticos
☑ Analytical reports
☑ Information union available on differentdatastores.
☑ Heat maps in real time indicating the origin ofthe information.
STRATIO EXPLORER
• It provides all views of knowledge:
☑ Raw information
☑ Standardized information to the datadictionary
☑ Information enriched with inference processes
☑ Correlated information (knowledge)
• It allows interaction with all sections of the Platform.
• Allows exchange different views of informationbetween users.
© Stratio 2015. Confidential, All Rights Reserved. 47
UNIQUE DATA SCIENCE PLATFORM OF BIG DATA COVERING ALL THE LIFE CYCLE OF INFORMATION