IntroSciKitLearnStatsModels

Intro to Scikit-Learn and StatsModels for the Absolute Beginner

Jennifer D. Davis, Ph.D.July 2, 2015

“Risks, I like to say, always pay off. You learn what to do, or what not to do.”

- Dr. Jonas E Salk

Outline

• Machine learning and statistics, tools of the data scientist

• Why python?• Popular Scikit Learn ML algorithms• Popular StatsModels algorithms• History of Scikit Learn and StatsModels• A use case: Polio Rates and Vaccination in the

United States

Note: many of the slides contain lots of text or notes so you don’t need to take written notes. At the same time, this is a talk for absolute beginners and so we present in a fairly non-technical manner.

Experienced audience members may find some information lacking detail or caveats. Additional information/tutorial will be available in a Jupyter notebook on github.

Why Python?

• Well developed scripting language that can also be utilized for software development & is *scalable*

• Well developed machine learning libraries backed by developers at Google among other places

• Runs on C/C++ in the background, so complex computations can run faster or be ported to C

• Runs on big data platforms like Spark (pySpark)

• Plays nicely with other programming languages (see Jython & Cython for porting to Java and C respectively, other methods work too)

Why Machine Learning?

• Machine learning is a subset of Artificial Intelligence, which is as the title suggests, uses mathematics to mimic learning intelligence

• Machine learning takes complex data, mathematically models it (using training data) under tunable parameters, and allows for predictions or assessments of *individuals* within or compared to groups

• Machine learning includes network analysis, deep learning, probability density graphs, supervised learning, unsupervised learning, dimensionality reduction (SVM, PCA) and other techniques

Why Statistics?

• Statistics is the mathematical application to data to build a ‘model’ of how observations fit into a ‘big picture’

• Statistical analyses often include correlations, assessments of how ‘good’ a model is based on error rates or population fits

• Statistics is an essential part of data science repertoire, but data scientists do not *rely* on statistics alone

• Examples include ANOVA, Pearson’s Correlation, ROC Curves (assesses various models), Time Series, Regressions

• Statistical techniques can be applied to Machine Learning algorithms to determine how effective, accurate or predictive the algorithm is, but they are not the only method

• Examples include: PPV, NPV, ROC Curves

How do Statistics & Machine Learning Relate to One Another?

• Statistical methods are used to assess the performance of a machine learning algorithm often but do not require data to ‘tune’ the statistical test

• Some statistical tests can be utilized as machine learning algorithms (e.g. log-odds regressions etc.)

• While Statistics is not generally considered part of artificial intelligence, it can be used to determine the accuracy, learning rate and other parameters tied to AI & Machine Learning.

• Machine learning algorithms use training data to tune their parameters. Remember the musician who’s instrument is out of tune? We don’t want that (under-fitting). And we don’t want the musician tuned only to themselves—but differently than the rest of the band--that’s over-fitting.

The Top 5 Machine Learning Algorithms for Data Science Available in Scikit-Learn

• PageRank (Principal Eigenvector)

• AdaBoost (Ensemble Learning)

• kNN (K-nearest neighbor Classification)

• Principal Component Analysis (dimensionality reduction)

• Neural Network Models (example, Restricted Boltzmann machines)

The Top 5 Statistical Models for Data Science Available in StatsModels

• Generalized linear models (e.g. ordinary least squares regressions)

• Nonparametric estimators

• Analysis of Variance

• Times Series Analysis

• Survival Analysis

Scikit Learn: History & Development

• Project started in 2007 as a Google Summer of Code project by David Counapeau.

• Matthieu Brucher then took it up as part of his thesis work.

• 2010 Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort & Vincent Miachel of INRIA took project leadership

• The first public release was February 1, 2010

• Since releases have appeared about every ~3 months

• A great community exists, so if you’d like to contribute your own code for machine-learning algorithms contact the scikit-learn team.

StatsModels History & Development

• Statsmodels is a Python library that provides classes & functions for estimation of many statistical functions

• It is useful for conducting tests such as ANOVA, ARMA, time-series, various flavors of regressions

• Results are tested against existing statistical packages to ensure accuracy

• For those of you who are used to R, you can fit models using R-style functional programming

• The modules were originally from scipy.stats written by Jonathan Taylor. It was later expanded and moved.

• As part of the Google Summer of Code 2009, statsmodels was tested, improved and released as a package. Since then a team of developers from Google and AWR have supported the development. To oversee coding practices (i.e. use of PEP-8) python.org typically reviews modules/libraries.

Use Cases: Scikit-Learn

• Classification – identify which category an object or person belongs too, eg. Spam detection or image recognition, or which of you will pay more than $40, $75 or $100 for a pair of shoes?

• Regression, predicting continuous-value attributes associated with an object, e.g. patient drug response based on other factors

• Clustering – grouping similar objects into sets, e.g. customer segmentation, grouping experimental outcomes

• Dimensionality reduction (reducing the number of variables included in ML analyses), see my github for example

• Model selection – comparing, cross-validating, choosing tuning parameters & metrics

• Preprocessing (yes, this is important!!!) – feature extraction & normalization, transforming input data such as text, into a vector or representation that can be used by a ML algorithm

Use Cases: StatsModels

• Linear regression models (I will show an example, but not the best example)

• Plotting data to assess its fit – are you over fitting or under fitting or just right?

• Discrete Choice Models – how good is your regression and other uses

• Nonparametric Statistics – e.g. t-tests for data not normally distributed

• General Linear Models – other flavors of regressions

• Robust Regression – more regressions!• Time Series Analyses – used in Fraud Detection• Others such as ANOVA, Kernel Density & Survival

Analyses

Polio Virus• Polio Virus (PV) is a RNA-based virus

• First epidemic was 1894. During late 1940s & 1950s, polio crippled more than 35,000 people per month in the US

• PV is still present in population of 3rd world countries

• President Franklin D. Roosevelt, a Polio survivor, helped to found the March of Dimes. His intent was to raise funds to develop a Polio Vaccine.

• Vaccine was invented by Dr. Jonas Salk

• US has been polio-free since 1979

Health Data: Polio Rates and Vaccination in the United States

• Polio is a viral RNA strand that causes myelytis, respiratory problems and sometimes paralysis

• Vaccination started in late 1950s & early 1960s

• Some info about the dataset– Data begins in 1916– Gathered by Centers for Disease Control– Downloaded from healthdata.gov

Analysis Work Flow Polio Data I

• Hypothesis 1: Polio Rates Decreased due to Vaccination

• Take a peak at the data & check for:– “Missing-ness”– Number of observations and types of observations

– Perform an initial visualization

• Perform a regression analysis to determine whether the use of vaccines was correlated to an exponential drop in Polio rates

What the ALL the Data Looks Like…

Assumptions using ALL the data (aggregated data) can lead to results that are less than interpretable or misleading…this graph makes it seem that vaccine was irrelevant as the Polio rates decreased exponentially before the vaccinations started…But is that true?

Some of the Code

Its good practice to import all the libraries and modules you will use at the top of your code file when doing ad hoc analyses. Jupyter notebook will be provided in github.

Some more of the Code (our regression)

Summary Outcome

• Alternate hypothesis: Rates of decreased incidence of Polio differed by state.

• Our linear regression was not a good fit using Ordinary Least Squares and the aggregated data might have been misleading

• There was significant skew and kurtosis

• Either a log-odds regression with a different distribution family chosen OR a non-parametric test would be more appropriate for this data considering skew; alternatively transforming to normal distribution can be appropriate

Analysis Work Flow Polio Data 2:

• Hypothesis 2: Polio rates decreased at different rates depending upon area of the country

• Take a peak at the data & check for:– Perform an initial visualization based upon state (we

are keeping things simplistic by choosing a state in the north, south, east, west)

• Perform a time-series analysis to determine if Polio rates were decreasing significantly between 1945-1965 (slightly before and slightly after vaccinations began)or it was a constant decrease. This analysis will be available in the Jupyter notebook.

What the data Looks like..

The visualization of data that is not aggregated, but rather separated by state, shows a binomial distribution, not an exponential decline.

Insights & Future Action Points for Polio Study

• Vaccination had an effect, which created an initial dip in Polio levels not long after vaccination began.

• Although the rate of polio decreased in response to vaccinations with a moderate decline, the incidences rose again.

• Ultimately vaccination and public health measures were able to wipe out new incidences of Polio from the US--but not until 1979, decades after the vaccine was first administered

• Population rates of disease do not necessarily correlate with vaccination

• Vigilance and population-level prevention should be supplemented (not replaced) with vaccination

Example 2: K-means Clustering of Iris Dataset

• Quick example of visual analysis & K-means clustering using the canonical ‘Iris’ Dataset

• This dataset includes different examples of Iris Flowers along with their physical features

• We are taking a simple example directly from the Sci-kit learn library but I will also add an example of cluster analysis for the Polio data at a later point in the Jupyter notebook within my github repository

Some of the Code

Output for K-means Clustering

Insights: As we might have guessed there are 3 clusters for most feature combinations, and these are generally separate for each type of flower—but not always! Can you see where this isn’t true?

The End

• Thank you ObjectRocket & Rackspace for sponsoring PyLadies ATX and this talk!

• Where to find the data: www.healthdata.gov

• Where to find all of the Code: https://github.com/jddavis-100/Statistics-and-Machine-Learning/wiki/Welcome-&-Table-of-Contents

• Where to find the Jupyter Notebook: I will be providing it to Sara Safavi so contact her soon. You can also find a static copy of it on my wiki (soon).

• Where to have fun: start on 6th & make your way to Rainey…or out to Salt Lick Grill or ACL festival in Zilker Park…or…any number of awesome places in ATX!

A very simplistic Confusion Matrix

Understanding the true positives, true negatives, false positives and false negatives, allows us to calculate accuracy & precision. We can also use this analyses on both the test and the training data. Other tests such as marginal error are sometimes used.

IntroSciKitLearnStatsModels

Documents

Transcript of IntroSciKitLearnStatsModels