Welcome (back) to IST 380 !

Post on 05-Jan-2016

27 views 3 download

Tags:

description

Welcome (back) to IST 380 !. Today: the old and the new. modeling trends from Twitter data. the most traditional approach to modeling data. This picture may soon become part of the OLD, if trends continue…. Assignments…. Homework #1 is complete! (2/5). - PowerPoint PPT Presentation

Transcript of Welcome (back) to IST 380 !

Welcome (back) to IST 380 !

Today: the old and the new

the most traditional approach to modeling data

modeling trends from Twitter data

This picture may soon become part of

the OLD, if trends continue…

Assignments…

Homework #1 is complete! (2/5)

Getting started with R (tutorial + "quiz" + text)

Pr #1: text, Chapters 6-9

Pr #2: Monty Hall challenge

Pr #3: writing a predictive model by hand…

Homework #3 is due next Tuesday (2/20)

Things are heating up here!

Make sure you can submit to our submission site!

Homework #2 is due tomorrow (2/12)

Pr #1: text, Chapter 10

Pr #2: the envelope, please!

Pr #3: linear models for prediction

Zac & Suleng

The age of data?

I prefer my data well-aged!

R path!

Progra

mm

ing

Skills

Subject Expertise

2

… R's toolset and its capabilities…

data collection

descriptive vs. generative vs. predictive statistics

predictions using linear regression

I predict we'll get here, but not necessarily in a straight line!…

3

1

Tweet "diffs" for a certain hashtag…

Chapter 10 introduces access to Twitter data and statistical descriptions using these data

Descriptive statistics: Twitter data

packageslibrarylapplyorderdiff

Some R: library

Once you have installed these packages

packages:bitopsRcurl

RJSONIOtwitteR

later:UsingR

You can ensure they're present with

library(bitops)

Chapter 10 will have you write a function to automate this process…

and so on…

Caution! Some of these may have to be installed by hand…

What if I don't have hands?!

Some R: style…I have NO COMMENT about this function!

Some R: style…

better, but not ideal

Some R: style…

use variables to hold intermediate values!

Some R: lapply and vapplyClock in Bristol, UK

lapply(X, FUN, ...)

Allow you to apply a function to every element of a list or a vector:

vapply(X, FUN, FUN.VALUE ...)

> L <- list(8,9,10)> lapply( L, add1 )[[1]][1] 9

[[2]][1] 10

[[3]][1] 11

> V <- 8:10> vapply( V, add1, FUN.VALUE=42 )[1] 9 10 11

UTC?

since before the railroads…red minute hand: Bristol

black minute hand: London (Greenwich)

Clock in Bristol, UKcoordinated universal time

Looking at the data…

UTC?

can be plotted as-is

take differences via as.numeric

- so that "2013-02-11 20:55:03 UTC"

becomes 1360616103

Some R: order and diff

order returns a permutation of its input…

> V <- c(3,4,2,1)

> V[1] 3 4 2 1

> order(V)[1] 4 3 1 2

>

order(..., na.last = TRUE, decreasing = FALSE)

What do these numbers mean?

Some R: order and diff

order returns a permutation of its input…

> V <- c(3,4,2,1)

> V[1] 3 4 2 1

> order(V)[1] 4 3 1 2

> V[order(V)][1] 1 2 3 4

order(..., na.last = TRUE, decreasing = FALSE)

What do these numbers mean?

Why not just use sort?

You can, but this let's you order

anything in the same way!

diff ?

Comparing tags?

#losangeles#sanfransisco

Which is which?

Comparing tags?

#losangeles#sanfrancisco

Which is which?

Comparing tags...

#losangeles#sanfrancisco

Which is which?

Next week: we will

quantify these differences

more carefully…

Generative statistics rgeomrunifrnorm … samplereplicate

Chapter 7 reviews repeated sampling and the resulting distribution of means

distribution of samples of state populations

Generative statistics rgeomrunifrnorm … samplereplicate

Chapter 7 reviews repeated sampling and the resulting distribution of means

distribution of samples of state populations

Monte Carlo method: run

a process many times to

gain insights into it…

Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?

Hw3 pr2: A second Monte Carlo example :

Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?

Hw3 pr2: A second Monte Carlo example :

Switch!but, then, should you switch back?

Both envelopes hold some positive amount of money (in a check or IOU), but one of these two envelopes holds twice as much money as the other.

Should you switch or stay?

Hw3 pr2: A second Monte Carlo example :

This week ~ write a

function to model this

process…

Hw3 pr2

Write a Mystery Envelope function:

… that runs one envelope trial

Another to run it N times:

ME_once <- function( amount_found=1.0, sors="switch", verbose=TRUE)

ME_ntimes <- function( n=100 )

sample_ME <- function( run_me=100 )

… and returns the amount of $ "earned"

And another to run it N times:

Assignments…

Homework #1 is complete! (2/5)

Getting started with R (tutorial + "quiz" + text)

Pr #1: text, Chapters 6-9

Pr #2: Monty Hall challenge

Pr #3: writing a predictive model by hand…

Homework #3 is due next Tuesday (2/20)

Things are heating up here!

Make sure you can submit to our submission site!

Homework #2 is due tomorrow (2/12)

Pr #1: text, Chapter 10

Pr #2: the envelope, please!

Pr #3: linear models for prediction

Big Ideas:

Predictive modeling

Linear regression

The human role… !

So, what is Machine Learning?

The goal of machine learning also known as

predictive statistics/analytics,

is to find a function

that yields outputs for previously-unseen inputs…

function

passenger details

prediction: did the passenger

survive?

So, what is Machine Learning?

The goal of machine learning also known as

predictive statistics/analytics,

is to find a function

that yields outputs for previously-unseen inputs…

function

passenger details

prediction: did the passenger

survive?For Hw2, you are building

this function by hand.

R is for Regression!

The oldest and (still) most popular technique for

automatically generating a model from data.

problem 3 this week…

RegressionWhat is it?

Regression ~ predictive modeling

this week: making an assumption of linear dependence on the

inputs

But why is it called regression?

1877: "reversion" (peas)

1885: "regression" (people)

make this sum of squared errors (residuals) as

small as possible

Let's look at lm1

pr3 this week: temperatures…

Temperature anomalies

The data…

deviations from the 1950-1980 global average of 14°C ~ 57.2°F

averaged (worldwide) and presented in units of 0.01°C

Your task…

• follow an analysis plan similar to the Galton data in the previous slides

• fit a linear model to the yearly average data and to each month's average data

• use your model to predict what the average temperature will be for 2012 and 2013

• is the linear model a reasonable one?

• we'll check (or you can…) the prediction for 2012 (but not 2013, yet)

Try it!

Help is available either with hw#2 (Monty Hall and Titanic using R's functions)

or hw#3 (Twitter, envelopes, and temperatures)

this evening during lab time…

Good luck with everything this week!

Lab !

The Titanic

April 15, 1912

1502 out of the 2224 passengers

died in the sinking

What characteristics did the survivors share?

The Data

There are 742 rows and 11 columns in the training data.

here are the 11 columns

Our goal

… is to write a function that takes in a row of new data and outputs whether that passenger would survive (1) or not (0).

A first predictor

A second predictor

Does the data match the famous emergency cry?

Testing our functions…

CS vs. IS and IT ?

www.acm.org/education/curric_vols/CC2005_Final_Report2.pdf

greater integration system-wide issues

smaller details machine specifics

CS vs. IS and IT ?

Where will IS go?

CS vs. IS and IT ?

IT ?

Where will IT go?

IT ?

The bigger picture

Weeks 10-12

Objects

Week 10

Week 11

Week 12

Weeks 13-15

Final Projects

classes vs. objects

methods and data

inheritance

Week 13

Week 14

Week 15

final projects

final projects

final exam

Data?!• Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Where?

state reminders…

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

Data! • Neighbor's name

• A place they consider home

• Are they working at a company now?

• How many U.S. states have they visited?

• Their favorite unhealthy food… ?

• Do they have any "Data Science"

(statistics, machine learning, CS)

background?

Zachary Dodds

Pittsburgh, PA

Harvey MuddWhere?

44

mostly CS for me…

M&Ms

be sure to set up your login + profile for the submission site…

This class is truly seminar-style:

we're devloping expertise in this field together.