Download - Welcome (back) to IST 380 !

Transcript
Page 1: Welcome (back) to IST 380 !

Welcome (back) to IST 380 !

Today: the old and the new

the most traditional approach to modeling data

modeling trends from Twitter data

This picture may soon become part of

the OLD, if trends continue…

Page 2: Welcome (back) to IST 380 !

Assignments…

Homework #1 is complete! (2/5)

Getting started with R (tutorial + "quiz" + text)

Pr #1: text, Chapters 6-9

Pr #2: Monty Hall challenge

Pr #3: writing a predictive model by hand…

Homework #3 is due next Tuesday (2/20)

Things are heating up here!

Make sure you can submit to our submission site!

Homework #2 is due tomorrow (2/12)

Pr #1: text, Chapter 10

Pr #2: the envelope, please!

Pr #3: linear models for prediction

Zac & Suleng

Page 3: Welcome (back) to IST 380 !

The age of data?

I prefer my data well-aged!

Page 4: Welcome (back) to IST 380 !

R path!

Progra

mm

ing

Skills

Subject Expertise

2

… R's toolset and its capabilities…

data collection

descriptive vs. generative vs. predictive statistics

predictions using linear regression

I predict we'll get here, but not necessarily in a straight line!…

3

1

Page 5: Welcome (back) to IST 380 !

Tweet "diffs" for a certain hashtag…

Chapter 10 introduces access to Twitter data and statistical descriptions using these data

Descriptive statistics: Twitter data

packageslibrarylapplyorderdiff

Page 6: Welcome (back) to IST 380 !

Some R: library

Once you have installed these packages

packages:bitopsRcurl

RJSONIOtwitteR

later:UsingR

You can ensure they're present with

library(bitops)

Chapter 10 will have you write a function to automate this process…

and so on…

Caution! Some of these may have to be installed by hand…

What if I don't have hands?!

Page 7: Welcome (back) to IST 380 !

Some R: style…I have NO COMMENT about this function!

Page 8: Welcome (back) to IST 380 !

Some R: style…

better, but not ideal

Page 9: Welcome (back) to IST 380 !

Some R: style…

use variables to hold intermediate values!

Page 10: Welcome (back) to IST 380 !

Some R: lapply and vapplyClock in Bristol, UK

lapply(X, FUN, ...)

Allow you to apply a function to every element of a list or a vector:

vapply(X, FUN, FUN.VALUE ...)

> L <- list(8,9,10)> lapply( L, add1 )[[1]][1] 9

[[2]][1] 10

[[3]][1] 11

> V <- 8:10> vapply( V, add1, FUN.VALUE=42 )[1] 9 10 11

Page 11: Welcome (back) to IST 380 !

UTC?

since before the railroads…red minute hand: Bristol

black minute hand: London (Greenwich)

Clock in Bristol, UKcoordinated universal time

Page 12: Welcome (back) to IST 380 !

Looking at the data…

Page 13: Welcome (back) to IST 380 !

UTC?

can be plotted as-is

take differences via as.numeric

- so that "2013-02-11 20:55:03 UTC"

becomes 1360616103

Page 14: Welcome (back) to IST 380 !

Some R: order and diff

order returns a permutation of its input…

> V <- c(3,4,2,1)

> V[1] 3 4 2 1

> order(V)[1] 4 3 1 2

>

order(..., na.last = TRUE, decreasing = FALSE)

What do these numbers mean?

Page 15: Welcome (back) to IST 380 !

Some R: order and diff

order returns a permutation of its input…

> V <- c(3,4,2,1)

> V[1] 3 4 2 1

> order(V)[1] 4 3 1 2

> V[order(V)][1] 1 2 3 4

order(..., na.last = TRUE, decreasing = FALSE)

What do these numbers mean?

Why not just use sort?

You can, but this let's you order

anything in the same way!

diff ?

Page 16: Welcome (back) to IST 380 !

Comparing tags?

#losangeles#sanfransisco

Which is which?

Page 17: Welcome (back) to IST 380 !

Comparing tags?

#losangeles#sanfrancisco

Which is which?

Page 18: Welcome (back) to IST 380 !

Comparing tags...

#losangeles#sanfrancisco

Which is which?

Next week: we will

quantify these differences

more carefully…

Page 19: Welcome (back) to IST 380 !

Generative statistics rgeomrunifrnorm … samplereplicate

Chapter 7 reviews repeated sampling and the resulting distribution of means

distribution of samples of state populations

Page 20: Welcome (back) to IST 380 !

Generative statistics rgeomrunifrnorm … samplereplicate

Chapter 7 reviews repeated sampling and the resulting distribution of means

distribution of samples of state populations

Monte Carlo method: run

a process many times to

gain insights into it…

Page 21: Welcome (back) to IST 380 !

Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?

Hw3 pr2: A second Monte Carlo example :

Page 22: Welcome (back) to IST 380 !

Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?

Hw3 pr2: A second Monte Carlo example :

Switch!but, then, should you switch back?

Page 23: Welcome (back) to IST 380 !

Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?

Hw3 pr2: A second Monte Carlo example :

This week ~ write a

function to model this

process…

Page 24: Welcome (back) to IST 380 !

Hw3 pr2

Write a Mystery Envelope function:

… that runs one envelope trial

Another to run it N times:

ME_once <- function( amount_found=1.0, sors="switch", verbose=TRUE)

ME_ntimes <- function( n=100 )

sample_ME <- function( run_me=100 )

… and returns the amount of $ "earned"

And another to run it N times:

Page 25: Welcome (back) to IST 380 !

Assignments…

Homework #1 is complete! (2/5)

Getting started with R (tutorial + "quiz" + text)

Pr #1: text, Chapters 6-9

Pr #2: Monty Hall challenge

Pr #3: writing a predictive model by hand…

Homework #3 is due next Tuesday (2/20)

Things are heating up here!

Make sure you can submit to our submission site!

Homework #2 is due tomorrow (2/12)

Pr #1: text, Chapter 10

Pr #2: the envelope, please!

Pr #3: linear models for prediction

Page 26: Welcome (back) to IST 380 !

Big Ideas:

Predictive modeling

Linear regression

The human role… !

Page 27: Welcome (back) to IST 380 !

So, what is Machine Learning?

The goal of machine learning also known as

predictive statistics/analytics,

is to find a function

that yields outputs for previously-unseen inputs…

function

passenger details

prediction: did the passenger

survive?

Page 28: Welcome (back) to IST 380 !

So, what is Machine Learning?

The goal of machine learning also known as

predictive statistics/analytics,

is to find a function

that yields outputs for previously-unseen inputs…

function

passenger details

prediction: did the passenger

survive?For Hw2, you are building

this function by hand.

Page 29: Welcome (back) to IST 380 !

R is for Regression!

The oldest and (still) most popular technique for

automatically generating a model from data.

problem 3 this week…

Page 30: Welcome (back) to IST 380 !

RegressionWhat is it?

Page 31: Welcome (back) to IST 380 !

Regression ~ predictive modeling

this week: making an assumption of linear dependence on the

inputs

Page 32: Welcome (back) to IST 380 !

But why is it called regression?

1877: "reversion" (peas)

1885: "regression" (people)

Page 33: Welcome (back) to IST 380 !
Page 34: Welcome (back) to IST 380 !
Page 35: Welcome (back) to IST 380 !
Page 36: Welcome (back) to IST 380 !
Page 37: Welcome (back) to IST 380 !
Page 38: Welcome (back) to IST 380 !
Page 39: Welcome (back) to IST 380 !
Page 40: Welcome (back) to IST 380 !
Page 41: Welcome (back) to IST 380 !
Page 42: Welcome (back) to IST 380 !
Page 43: Welcome (back) to IST 380 !
Page 44: Welcome (back) to IST 380 !
Page 45: Welcome (back) to IST 380 !

make this sum of squared errors (residuals) as

small as possible

Page 46: Welcome (back) to IST 380 !
Page 47: Welcome (back) to IST 380 !

Let's look at lm1

Page 48: Welcome (back) to IST 380 !
Page 49: Welcome (back) to IST 380 !
Page 50: Welcome (back) to IST 380 !
Page 51: Welcome (back) to IST 380 !
Page 52: Welcome (back) to IST 380 !
Page 53: Welcome (back) to IST 380 !
Page 54: Welcome (back) to IST 380 !
Page 55: Welcome (back) to IST 380 !
Page 56: Welcome (back) to IST 380 !
Page 57: Welcome (back) to IST 380 !
Page 58: Welcome (back) to IST 380 !

pr3 this week: temperatures…

Page 59: Welcome (back) to IST 380 !

Temperature anomalies

Page 60: Welcome (back) to IST 380 !

The data…

deviations from the 1950-1980 global average of 14°C ~ 57.2°F

averaged (worldwide) and presented in units of 0.01°C

Page 61: Welcome (back) to IST 380 !

Your task…

• follow an analysis plan similar to the Galton data in the previous slides

• fit a linear model to the yearly average data and to each month's average data

• use your model to predict what the average temperature will be for 2012 and 2013

• is the linear model a reasonable one?

• we'll check (or you can…) the prediction for 2012 (but not 2013, yet)

Page 62: Welcome (back) to IST 380 !

Try it!

Help is available either with hw#2 (Monty Hall and Titanic using R's functions)

or hw#3 (Twitter, envelopes, and temperatures)

this evening during lab time…

Good luck with everything this week!

Page 63: Welcome (back) to IST 380 !

Lab !

Page 64: Welcome (back) to IST 380 !

The Titanic

April 15, 1912

1502 out of the 2224 passengers

died in the sinking

What characteristics did the survivors share?

Page 65: Welcome (back) to IST 380 !

The Data

There are 742 rows and 11 columns in the training data.

here are the 11 columns

Page 66: Welcome (back) to IST 380 !

Our goal

… is to write a function that takes in a row of new data and outputs whether that passenger would survive (1) or not (0).

Page 67: Welcome (back) to IST 380 !

A first predictor

Page 68: Welcome (back) to IST 380 !

A second predictor

Does the data match the famous emergency cry?

Page 69: Welcome (back) to IST 380 !

Testing our functions…

Page 70: Welcome (back) to IST 380 !
Page 71: Welcome (back) to IST 380 !
Page 72: Welcome (back) to IST 380 !
Page 73: Welcome (back) to IST 380 !
Page 74: Welcome (back) to IST 380 !

CS vs. IS and IT ?

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf

greater integration system-wide issues

smaller details machine specifics

Page 75: Welcome (back) to IST 380 !

CS vs. IS and IT ?

Where will IS go?

Page 76: Welcome (back) to IST 380 !

CS vs. IS and IT ?

Page 77: Welcome (back) to IST 380 !

IT ?

Where will IT go?

Page 78: Welcome (back) to IST 380 !

IT ?

Page 79: Welcome (back) to IST 380 !

The bigger picture

Weeks 10-12

Objects

Week 10

Week 11

Week 12

Weeks 13-15

Final Projects

classes vs. objects

methods and data

inheritance

Week 13

Week 14

Week 15

final projects

final projects

final exam

Page 80: Welcome (back) to IST 380 !

Data?!• Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Where?

Page 81: Welcome (back) to IST 380 !

state reminders…

Page 82: Welcome (back) to IST 380 !

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

Page 83: Welcome (back) to IST 380 !

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-style:

we're devloping expertise in this field together.