Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

19
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib Jimmy Lai r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536 2013/02/17

description

Big data analysis relies on exploiting various handy tools to gain insight from data easily. In this talk, the speaker demonstrates a data mining flow for text classification using many Python tools. The flow consists of feature extraction/selection, model training/tuning and evaluation. Various tools are used in the flow, including: Pandas for feature processing, scikit-learn for classification, IPython, Notebook for fast sketching, matplotlib for visualization.

Transcript of Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Page 1: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Classification in Python – using Pandas, scikit-learn, IPython

Notebook and matplotlib Jimmy Lai

r97922028 [at] ntu.edu.tw http://tw.linkedin.com/pub/jimmy-lai/27/4a/536

2013/02/17

Page 3: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Fast prototyping - IPython Notebook

• Write python code in browser:

– Exploit the remote server resources

– View the graphical results in web page

– Sketch code pieces as blocks

– Refer http://www.slideshare.net/jimmy_lai/fast-data-mining-flow-

prototyping-using-ipython-notebook for more introduction.

Text Classification in Python 3

Page 4: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Demo Code

• Demo Code: ipython_demo/text_classification_demo.ipynb in https://bitbucket.org/noahsark/slideshare

• Ipython Notebook: – Install

$ pip install ipython

– Execution (under ipython_demo dir)

$ ipython notebook --pylab=inline

– Open notebook with browser, e.g. http://127.0.0.1:8888

Text Classification in Python 4

Page 5: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Machine learning classification

• 𝑋𝑖 = [𝑥1, 𝑥2, … , 𝑥𝑛], 𝑥𝑛 ∈ 𝑅

• 𝑦𝑖 ∈ 𝑁

• 𝑑𝑎𝑡𝑎𝑠𝑒𝑡 = 𝑋, 𝑌

• 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑓: 𝑦𝑖 = 𝑓(𝑋𝑖)

Text Classification in Python 5

Page 6: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text classification

Text Classification in Python 6

Feature Generation

Feature Selection

Classification Model Training

Model Parameter

Tuning

Page 7: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

From: [email protected] (zhenghao yeh) Subject: Re: Newsgroup Split Organization: University of Southern California, Los Angeles, CA Lines: 18 Distribution: world NNTP-Posting-Host: caspian.usc.edu In article <[email protected]>, [email protected] (Chris Herringshaw) writes: |> Concerning the proposed newsgroup split, I personally am not in favor of |> doing this. I learn an awful lot about all aspects of graphics by reading |> this group, from code to hardware to algorithms. I just think making 5 |> different groups out of this is a wate, and will only result in a few posts |> a week per group. I kind of like the convenience of having one big forum |> for discussing all aspects of graphics. Anyone else feel this way? |> Just curious. |> |> |> Daemon |> I agree with you. Of cause I'll try to be a daemon :-) Yeh USC

Dataset: 20 newsgroups

dataset

Text Classification in Python 7

Text

Structured Data

Page 8: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Dataset in sklearn

• sklearn.datasets

– Toy datasets

– Download data from http://mldata.org repository

• Data format of classification problem

– Dataset

• data: [raw_data or numerical]

• target: [int]

• target_names: [str]

Text Classification in Python 8

Page 9: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Feature extraction from structured data (1/2)

• Count the frequency of keyword and select the keywords as features: ['From', 'Subject', 'Organization', 'Distribution', 'Lines']

• E.g. From: [email protected] (where's my thing)

Subject: WHAT car is this!?

Organization: University of Maryland, College Park

Distribution: None

Lines: 15

Text Classification in Python 9

Keyword Count Distribution 2549 Summary 397 Disclaimer 125 File 257 Expires 116 Subject 11612 From 11398 Keywords 943 Originator 291 Organization 10872 Lines 11317 Internet 140 To 106

Page 10: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Feature extraction from structured data (2/2)

• Separate structured data and text data

– Text data start from “Line:”

• Transform token matrix as numerical matrix by sklearn.feature_extractionDictVectorizer

• E.g.

[{‘a’: 1, ‘b’: 1}, {‘c’: 1}] => [[1, 1, 0], [0, 0, 1]]

Text Classification in Python 10

Page 11: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Feature extraction in sklearn

• sklearn.feature_extraction.text

• CountVectorizer

– Transform articles into token-count matrix

• TfidfVectorizer

– Transform articles into token-TFIDF matrix

• Usage:

– fit(): construct token dictionary given dataset

– transform(): generate numerical matrix

Text Classification in Python 11

Page 12: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Feature extraction

• Analyzer – Preprocessor: str -> str

• Default: lowercase

• Extra: strip_accents – handle unicode chars

– Tokenizer: str -> [str] • Default: re.findall(ur"(?u)\b\w\w+\b“, string)

– Analyzer: str -> [str] 1. Call preprocessor and tokenizer

2. Filter stopwords

3. Generate n-gram tokens

Text Classification in Python 12

Page 13: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Text Classification in Python 13

Page 14: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Feature Selection

• Decrease the number of features:

– Reduce the resource usage for faster learning

– Remove the most common tokens and the most rare tokens (words with less information):

• Parameter for Vectorizer: – max_df

– min_df

– max_features

Text Classification in Python 14

Page 15: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Classification Model Training

• Common classifiers in sklearn:

– sklearn.linear_model

– sklearn.svm

• Usage:

– fit(X, Y): train the model

– predict(X): get predicted Y

Text Classification in Python 15

Page 16: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Cross Validation

• When tuning the parameters of model, let each article as training and testing data alternately to ensure the parameters are not dedicated to some specific articles.

– from sklearn.cross_validation import KFold

– for train_index, test_index in KFold(10, 2):

• train_index = [5 6 7 8 9]

• test_index = [0 1 2 3 4]

Text Classification in Python 16

Page 17: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Performance Evaluation

• 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑡𝑝

𝑡𝑝+𝑓𝑝

• 𝑟𝑒𝑐𝑎𝑙𝑙 =𝑡𝑝

𝑡𝑝+𝑓𝑛

• 𝑓1𝑠𝑐𝑜𝑟𝑒 = 2𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙

𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

• sklearn.metrics

– precision_score

– recall_score

– f1_score

Text Classification in Python 17

Source: http://en.wikipedia.org/wiki/Precision_and_recall

Page 18: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Visualization

1. Matplotlib

2. plot() function of Series, DataFrame

Text Classification in Python 18

Page 19: Text Classification in Python – using Pandas, scikit-learn, IPython Notebook and matplotlib

Experiment Result

• Future works:

– Feature selection by statistics or dimension reduction

– Parameter tuning

– Ensemble models

Text Classification in Python 19