«Анализ больших данных и их подготовка перед...

© ALTOROS Systems | CONFIDENTIAL

Big Data Analysis and Data pre-processing

Parfenovich SofiaData Science specialist

2013, Minsk

© ALTOROS Systems | CONFIDENTIAL 2

Typical Data analysis task

Data pre-processing problem and k-means

Pre-processing methods

Feature selection and why You shouldn’t listen to the client

Agenda


• Recommendation system

• Groups in social networks

• Image processing

Clustering

• Credit risk• Image processing• Trade systems• Biometrics

Classification

• Trade systems• Business tasks• Medicine

Regression

Typical tasks


• Time series prediction

• Trading• Business internal

tasks

Prediction

• Image processing• Semantics• Time series

analysis

Pattern recognition

• Trading• Monitoring

systems

Anomalies detection

Typical tasks


Recommendation system

CustomerSame purchase history:

{item1, item2, item3}{item1, item2,??}

Items

Similar features:(for books):

{author, genre, country, year..}

Clustering and k-means


How to solve?

Gain Data

Use k-means for clustering

Divide data into clusters

Recommend items from the

same cluster



Algorithm:

Select initial centroids

Calculate the distance between

centroids and points

Make clusters

Re-calculate centroids

Enjoy the result

K-means


Purchase history

Point:

Euclidian distance:

(Client1, Client2) =



Item features

Point:

Euclidian distance:

(Item1, Item2) = 76786788

(Item2, Item3) = 67

(Item1, Item3) = 6757665567566

??!!

Calculating of the distance

clientID

Item1

Item2

… ItemN-1

ItemN

1 0 1 0 0 1

2 1 1 0 0 1

3 0 0 0 1 1

2

ItemID

F1 (author)

F2(genre)

… FN(year)

1 34354 12 … 1990

2 23 7 … 1898

3 5676 67 … 2013

3

1


Recommendation system

Same purchase history Success!!!

Similar features

Fail!!!!

Data pre-processing or

algorithm modification

Success!!!



Problem:

Raw Data

Non-numeric data

Numeric data

Missed values

Internal Problems

Outliers

Noisy data

Uniform distribution

Solution:

Raw Data

Encoding

Normalization (Standardization)

Interpolation

Internal Problems

Detecting and smoothing

De-noise

Change dimension of the data space

Pre-processing


Encoding

– {“small”, “medium”, “big”} => {0,1,2} => {-1, 0, 1}

– {“paris”, “london”, “milan”} => {{1,0,0},{0,1,0},{0,0,1}}

Normalization

– [-100; 300] => [-1: 1] or [0; 1]

Standardization

– {mean(data) = a, std(data) = b} => {mean(data) = 0, std(data) = 1}

Interpolation

– {0.3, 0.5, 0, NaN, -0.2} => {0.3, 0.5, 0, -0.1, -0.2}

Outliers

Noise

Uniform distribution

Pre-processing methods


Customer

Initial data

Method restrictions

Feature selection

Preliminary data analysis

Changing of the dimension

Feature Selection


Gain information

• Try to gain as much information as possible

Get some expert

knowledge

• Ask, what kind of results are expected• Understand main principles and nature

of data

Don’t eliminate features

• Mean a priori elimination without any research

• Don’t listen to client and experts

Use preliminary

Data analysis

• Check weather data match the problem

• Pre-processing?

The way to Success


Questions?

«Анализ больших данных и их подготовка перед...

Technology

Transcript of «Анализ больших данных и их подготовка перед...