«Анализ больших данных и их подготовка перед...
-
Upload
- -
Category
Technology
-
view
662 -
download
3
description
Transcript of «Анализ больших данных и их подготовка перед...
© ALTOROS Systems | CONFIDENTIAL
Big Data Analysis and Data pre-processing
Parfenovich SofiaData Science specialist
2013, Minsk
© ALTOROS Systems | CONFIDENTIAL 2
Typical Data analysis task
Data pre-processing problem and k-means
Pre-processing methods
Feature selection and why You shouldn’t listen to the client
Agenda
© ALTOROS Systems | CONFIDENTIAL 3
• Recommendation system
• Groups in social networks
• Image processing
Clustering
• Credit risk• Image processing• Trade systems• Biometrics
Classification
• Trade systems• Business tasks• Medicine
Regression
Typical tasks
© ALTOROS Systems | CONFIDENTIAL 4
• Time series prediction
• Trading• Business internal
tasks
Prediction
• Image processing• Semantics• Time series
analysis
Pattern recognition
• Trading• Monitoring
systems
Anomalies detection
Typical tasks
© ALTOROS Systems | CONFIDENTIAL 5
Recommendation system
CustomerSame purchase history:
{item1, item2, item3}{item1, item2,??}
Items
Similar features:(for books):
{author, genre, country, year..}
Clustering and k-means
© ALTOROS Systems | CONFIDENTIAL 6
How to solve?
Gain Data
Use k-means for clustering
Divide data into clusters
Recommend items from the
same cluster
Clustering and k-means
© ALTOROS Systems | CONFIDENTIAL 7
Algorithm:
Select initial centroids
Calculate the distance between
centroids and points
Make clusters
Re-calculate centroids
Enjoy the result
K-means
© ALTOROS Systems | CONFIDENTIAL 8
Purchase history
Point:
Euclidian distance:
(Client1, Client2) =
(Client2, Client3) =
(Client1, Client3) =
Item features
Point:
Euclidian distance:
(Item1, Item2) = 76786788
(Item2, Item3) = 67
(Item1, Item3) = 6757665567566
??!!
Calculating of the distance
clientID
Item1
Item2
… ItemN-1
ItemN
1 0 1 0 0 1
2 1 1 0 0 1
3 0 0 0 1 1
2
ItemID
F1 (author)
F2(genre)
… FN(year)
1 34354 12 … 1990
2 23 7 … 1898
3 5676 67 … 2013
3
1
© ALTOROS Systems | CONFIDENTIAL 9
Recommendation system
Same purchase history Success!!!
Similar features
Fail!!!!
Data pre-processing or
algorithm modification
Success!!!
Clustering and k-means
© ALTOROS Systems | CONFIDENTIAL 10
Problem:
Raw Data
Non-numeric data
Numeric data
Missed values
Internal Problems
Outliers
Noisy data
Uniform distribution
Solution:
Raw Data
Encoding
Normalization (Standardization)
Interpolation
Internal Problems
Detecting and smoothing
De-noise
Change dimension of the data space
Pre-processing
© ALTOROS Systems | CONFIDENTIAL 11
Encoding
– {“small”, “medium”, “big”} => {0,1,2} => {-1, 0, 1}
– {“paris”, “london”, “milan”} => {{1,0,0},{0,1,0},{0,0,1}}
Normalization
– [-100; 300] => [-1: 1] or [0; 1]
Standardization
– {mean(data) = a, std(data) = b} => {mean(data) = 0, std(data) = 1}
Interpolation
– {0.3, 0.5, 0, NaN, -0.2} => {0.3, 0.5, 0, -0.1, -0.2}
Outliers
Noise
Uniform distribution
Pre-processing methods
© ALTOROS Systems | CONFIDENTIAL 12
Customer
Initial data
Method restrictions
Feature selection
Preliminary data analysis
Changing of the dimension
Feature Selection
© ALTOROS Systems | CONFIDENTIAL 13
Gain information
• Try to gain as much information as possible
Get some expert
knowledge
• Ask, what kind of results are expected• Understand main principles and nature
of data
Don’t eliminate features
• Mean a priori elimination without any research
• Don’t listen to client and experts
Use preliminary
Data analysis
• Check weather data match the problem
• Pre-processing?
The way to Success
© ALTOROS Systems | CONFIDENTIAL 14
Questions?