Dachis Group Pig Hackday: Pig 202

21
® 2011 Dachis Group. dachisgroup.com Dachis Group Las Vegas 2012 Intermediate Pig Know How Timothy Potter (Twitter: thelabdude) Pigout Hackday, Austin TX May 11, 2012

description

Slides for Pig 202 tutorial presented by Timothy Potter at DG Pig Hackday, May 11, 2012

Transcript of Dachis Group Pig Hackday: Pig 202

Page 1: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

Dachis GroupLas Vegas 2012

Intermediate Pig Know How

Timothy Potter (Twitter: thelabdude)Pigout Hackday, Austin TXMay 11, 2012

Page 2: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

UFO Sightings Data Set

1. Which US city has the most UFO sightings overall?2. What is the most common UFO shape within a 100 mile radius of

your answer for #1?

Pig Mahout Example: Training 20 Newsgroups Classifier

• Loading messages using a custom loader• Hashed Feature Vectors• Train Logistic Regression Model• Evaluate Model on held-out Data

Agenda

Page 3: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

1. What US city has the most UFO sightings overall?

2. What is the most common UFO shape within a 100 mile radius of your answer for #1?

Using Two Data Sets:• UFO sightings data set

available from Infochimps• US city / states with geo-

codes available from US Census

UFO Sightings

Page 4: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

19930809 19990816 Westminster, CO triangle 1 minute A white puffy cottonball appeared and then a triangle ...

20010111 20010113 Pueblo, CO fireball 30 sec Blue fireball lights up the skies of colorado and nebraska ...

20001026 20030920 Aurora, CO triangle 10 Minutes Triangular craft (two footbal fields in size)As reported to Art Bell ...

ufo_sightings = LOAD ’ufo/ufo_awesome.tsv' AS (

sighted_at: chararray, reported_at: chararray,

location: chararray, shape: chararray,

duration: chararray, description: chararray

);

ufo_sightings_split_loc = FOREACH (

FILTER ufo_sightings BY sighted_at IS NOT NULL AND location IS NOT NULL

) {

split_city = REGEX_EXTRACT(TRIM(location), '([A-Z][\\w\\s\\-\\.]*)(, )([A-Z]{2})', 1);

split_state = REGEX_EXTRACT(TRIM(location), '([A-Z][\\w\\s\\-\\.]*)(, )([A-Z]{2})', 3);

city_lc = (split_city IS NOT NULL ? LOWER(split_city) : null);

state_lc = (split_state IS NOT NULL ? LOWER(split_state) : null);

GENERATE city_lc AS city, state_lc AS state, ...

Load Sightings Data

Pig provides functions

for doing basic text munging tasks or

use a UDF ...

Page 5: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

CO 0862000 02411501 Pueblo city 138930097 2034229 53.641 0.785 38.273147 -104.612378

CO 0883835 02412237 Westminster city 81715203 5954681 31.550 2.299 39.882190 -105.064426

CO 0804000 02409757 Aurora city 400759192 1806832 154.734 0.698 39.688002 -104.689740

us_cities = LOAD ’dev/data/usa_cities_and_towns.tsv' AS (

state: chararray, geoid: chararray,

ansicode: chararray, name: chararray,

....

latitude: double, longitude: double

);

us_cities_w_geo = FOREACH us_cities {

city_name = SUBSTRING(LOWER(name), 0, LAST_INDEX_OF(name,' '));

GENERATE TRIM(city_name) as city, TRIM(LOWER(state)) AS state, latitude, longitude;

};

Load US Cities Datawith geo-codes

Use projection toselect only the fields

you want to work with:city, state, latitude, longitude

Page 6: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

What US city has the most UFO sightings overall?Things to consider ...

1. Need to select only sightings from US cities

2. Need to count sightings for each city

3. Need to do a TOP to get the city with the most sightings

Join sightings data with US city data

Group results from step 1 by state/city and count

Descending sort on count and choose the top.

Page 7: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

ufo_sightings_with_geo = FOREACH (

JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’

) GENERATE

ufo_sightings_by_city::state AS state,

ufo_sightings_by_city::city AS city,

ufo_sightings_by_city::sighted_at AS sighted_at,

ufo_sightings_by_city::sighted_year AS sighted_year,

ufo_sightings_by_city::shape AS shape,

us_cities_w_geo::latitude AS latitude,

us_cities_w_geo::longitude AS longitude;

What US city has the most UFO sightings overall?

Inner JOIN by (state,city) to

attach geo-codesto sightings

Group by (state,city)to get number ofsightings for each

CityPoor man’s TOP

grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude)) GENERATE FLATTEN($0) AS (state,city,latitude,longitude), COUNT($1) AS the_count;

most_freq = ORDER grp_by_state_city BY the_count DESC;top_city_state = LIMIT most_freq 1;

DUMP top_city_state;

Page 8: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

(seattle,wa,446,light,47.620499,-122.350876)

Seattle only averages 58 sunny days a year. Coincidence?

Maybe all the UFOs are coming to look at the Space Needle?

What US city has the most UFO sightings overall?

Page 9: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

ufo_sightings_with_geo = FOREACH (

JOIN ufo_sightings_by_city BY (state,city), us_cities_w_geo BY (state,city) USING ‘replicated’

) GENERATE

ufo_sightings_by_city::state AS state,

ufo_sightings_by_city::city AS city,

ufo_sightings_by_city::sighted_at AS sighted_at,

ufo_sightings_by_city::sighted_year AS sighted_year,

ufo_sightings_by_city::shape AS shape,

us_cities_w_geo::latitude AS latitude,

us_cities_w_geo::longitude AS longitude;

grp_by_state_city = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))

GENERATE FLATTEN($0) AS (state,city,latitude,longitude),

COUNT($1) AS the_count;

most_freq = ORDER grp_by_state_city BY the_count DESC;

top_city_state = LIMIT most_freq 1;

DUMP top_city_state;

Pig Explain: Pull back the covers ...

Job 1 - Mapper

Job 1 - Reducer

Job 2 – Full Map/Reduce

pig -x local -e ‘explain -script ufo.pig’

Page 10: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

Things we need to solve this ...

1) Some way to calculate geographical distance from a geographical location (lat / lng)

2) Iterate over all cities that have sightings to get the distance from our centroid

3) Filter by distance and count shapes

What is the most common UFO shape within a 100 mile radius of your answer for #1?

Page 11: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

REGISTER some_path/my-ufo-app-1.0-SNAPSHOT.jar;

DEFINE CalcGeoDistance com.dachisgroup.ufo.GeoDistance();

...

with_distance = FOREACH calc_dist {

GENERATE city, state,

CalcGeoDistance(from_lat, from_lng, to_lat, to_lng) AS dist_in_miles;

};

Let’s build a UDF that uses the Haversine Forumla to calculate

distance between two points

See: http://en.wikipedia.org/wiki/Haversine_formula

UDF: User Defined Function

Page 12: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

import org.apache.pig.EvalFunc;

import org.apache.pig.data.Tuple;

public class GeoDistance extends EvalFunc<Double> {

public Double exec(Tuple input) throws IOException {

if (input == null || input.size() < 4 || input.isNull(0) ||

input.isNull(1) || input.isNull(2) || input.isNull(3)) {

return null;

}

Double dist = null;

try {

Double fromLat = (Double)input.get(0);

Double fromLng = (Double)input.get(1);

Double toLat = (Double)input.get(2);

Double toLng = (Double)input.get(3);

dist = haversineDistanceInMiles(fromLat, toLat, fromLng, toLng);

} catch (Exception exc) { // better to return null than to throw exception }

return dist;

}

protected double haversineDistanceInMiles(double lat1, double lat2, double lon1, double lon2) {

// details excluded for brevity – see http://www.movable-type.co.uk/scripts/latlong.html

return dist;

}

UDF: User Defined Function

Page 13: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

What is the most common UFO shape ...

top_city = FOREACH top_city_state GENERATE city, state, latitude as from_lat, longitude as from_lng;

sighting_cities = FOREACH (GROUP ufo_sightings_with_geo BY (state,city,latitude,longitude))

GENERATE FLATTEN($0) AS (state,city,latitude,longitude);

Including lat / lng in group bykey to help reduce number of

records I’m crossing

Pig only supports equi-joinsso we need to use CROSS

to get the lat / lng of the twopoints to calculate distance

using our UDF

When joining, list largest relationfirst and smallest last and optimizeif possible such as using ‘replicated’

calc_dist = FOREACH (CROSS sighting_cities, top_city) GENERATE sighting_cities::city AS city, sighting_cities::state AS state, sighting_cities::latitude AS to_lat, sighting_cities::longitude AS to_lng, CalcGeoDistance(top_city::from_lat, top_city::from_lng, sighting_cities::latitude, sighting_cities::longitude) AS dist_in_miles;

near = FILTER calc_dist BY dist_in_miles < 100;

shapes = FOREACH (JOIN ufo_sightings_with_geo BY (state,city), near BY (state,city) USING ‘replicated’) generate ufo_sightings_with_geo::shape as shape;

count_shapes = FOREACH (GROUP shapes BY shape) GENERATE $0 AS shape, COUNT($1) AS the_count;sorted_counts = ORDER count_shapes BY the_count DESC;

Page 14: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

In Pig:

fs -getmerge sorted_counts sorted_counts.txt

In R:

shapes <- read.table(”sorted_counts.txt",

header=F, sep="\t", col.names=c("shape","occurs"), stringsAsFactors=F)

barplot(c(shapes$occurs),

main="UFO Sightings (Shapes)",

ylab="Number of Sightings",

ylim=c(0,500),

cex.names=0.8,

las=2,

names.arg=c(shapes$shape))

Visualize Results

Page 15: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

Use Pig’s IsEmpty function to isolate records that only occur in one of the relations ... such as sightings in cities not in the US census list:

city_sightings = COGROUP ufo_sightings_by_city BY (state,city) OUTER,

us_cities_w_geo BY (state,city);

outside_us_sightings =

FOREACH (FILTER city_sightings BY IsEmpty(us_cities_w_geo)) GENERATE

FLATTEN(ufo_sightings_by_city);

Set Logic in Pig

Page 16: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

Example Integration: Pig-Vector

GitHub project by Ted Dunning, Mahout Committer

https://github.com/tdunning/pig-vector

Use Case:

Train Logistic Regression Model from Pig

Hello World of ML – 20 Newsgroups

Mahout and Pig

Page 17: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

Load 20-newsgroups messages using custom Pig LoadFunc:

docs = LOAD '20news-bydate-train/*/*’ USING

org.apache.mahout.pig.MessageLoader()

AS (newsgroup, id:int, subject, body);

In Java:

public class MessageLoader extends LoadFunc {

public void setLocation(String location, Job job) throws IOException {

// setup where we're reading data from

}

public InputFormat getInputFormat() throws IOException {

return new TextInputFormat() {

// ...

};

}

public Tuple getNext() throws IOException {

// parse message and build Tuple that matches the schema

}

}

Mahout and PigStep 1: Load the Training Data

Page 18: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

-- Import UDF, define vectorizing strategy and fixed size of feature vector

DEFINE encodeVector org.apache.mahout.pig.encoders.EncodeVector('100000', 'subject+body', 'group:word, article:numeric, subject:text, body:text');

vectors = FOREACH docs GENERATE newsgroup, encodeVector(*) as v;

Result is a hashed feature vector where features

are mapped to indexes in a fixed size sparse vector

(from Mahout)

Fixed sized vectors are needed to train

Mahout’s SGD-based logistic regression model

Mahout and PigStep 2: Vectorize using Pig-Vector UDF

Page 19: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

DEFINE train org.apache.mahout.pig.LogisticRegression('iterations=5, inMemory=true, features=100000, categories=alt.atheism comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt soc.religion.christian talk.religion.misc');

/* put the training data in a single bag. We could train multiple models this way */

grouped = group vectors all;

/* train the actual model. The key is bogus to satisfy the sequence vector format. */

model = foreach grouped generate 1 as key, train(vectors) as model;

store model into 'pv-tmp/news_model' using PigModelStorage();

Mahout and PigStep 3: Train the Model

Page 20: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

DEFINE evaluate org.apache.mahout.pig.LogisticRegressionEval('sequence=pv-tmp/news_model/part-r-00000, key=1');

test = load '20news-bydate-test/*/*' using org.apache.mahout.pig.MessageLoader()

as (newsgroup, id:int, subject, body);

testvecs = foreach test generate newsgroup, encodeVector(*) as v;

evalvecs = foreach testvecs generate evaluate(v);

Mahout and PigStep 4: Evaluate the Model

Page 21: Dachis Group Pig Hackday: Pig 202

® 2011 Dachis Group.

dachisgroup.com

For Slides and Pig script email me at: [email protected]

Twitter: thelabdude

Questions?