Categorical Data Analysis in Python

Post on 16-Jul-2015

551 views 0 download

Tags:

Transcript of Categorical Data Analysis in Python

1

Categorical Data Analysis in Python

By

Jaidev DeshpandeData Scientist, DataCulture Analytics

twitter.com/jaidevd

2

Problem: Who's likely to attend the next meetup?

● Who comes often?● Men / Women?● Where do you live? How far from the venue?● Proficiency with Python

(Beginner / Intermediate / Advanced)?● Area of interest?

3

Something like..

Attendees Features

Attendance (%)

Gender Pincode Proficiency in Python

Interest ...

attendee_1 80 M 411013 Intermediate Web ...

attendee_2 30 F 411040 Advanced Test / Automation

...

attendee_3 55 M 411001 Beginners Scientific ...

... ... ... ... ... ... ...

● 1. Numerical features – continuous and quantitative● 2. Categorical features – discrete and qualitative

4

Common Numerical Operations on Data

● Obviously – add, subtract, mu ltiply divide● Statistical moments● Operations in vector spaces

– D istance measures– Slicing

5

Comparison of Operations

Numerical Data

Addition, subtract, multiply, divide

Mean, Variance, Standard Deviation

Vector Spaces – the very idea of 'measuring'

Categorical Data (Strings, etc)

What's the product of two strings?

The average pincode of two areas?

&%%#&$$*&!!!!

At least get some numbers!

6

One-hot Encoding

● [Apples,

Oranges,

Mangoes]

● sklearn.preprocessing.OneHotEncoder

● sklearn.feature_extraction.DictVectorizer

[0, 0, 1;

0, 1, 0;

1, 0, 0]

7

Original Data

Attendees Features

Attendance (%)

Gender Pincode Proficiency in Python

Interest ...

attendee_1 80 [0 1] [1 0 0 … 0] [0 1 0] [1 0 0 0 0 0] ...

attendee_2 30 [1 0] [0 1 0 … 0] [1 0 0] [0 1 0 0 0 0] ...

attendee_3 55 [0 1] [0 0 1 … 0] [0 0 1] [0 0 1 0 0 0] ...

... ... ... ... ... ... ...

8

Curse of Dimensionality

9

Correspondence Analysis

● Contingency tables (pandas.crosstab)

profeciency advanced beginner intermediate

gender

F 1 0 0

M 0 1 1● Different numerical measures● Perceptual maps

10

Correspondence Analysis

● How are proficiencies related w.r.t gender? (Row profiles)● How are genders related w.r.t proficiency? (Column profiles)

– Cosine similarity– Correlation / Covariance

● How are they interrelated?– Weighted chi-squared distance

● Can the dimensionality be reduced?– Singular value decomposition / PCA– sklearn.decomposition.PCA

– sklearn.decomposition.TruncatedSVD

11

Sample Problem

● Consider the proficiency and interest features from the original problem

● Fake data with 100 observations ● Contingency matrix:

automation scientific web

advanced 8 1 7

beginner 13 9 35

intermediate 7 1 19

12

Results

13

Source and Tutorials

● http://github.com/motherbox/mca