Apache pig as a researcher’s stepping stone

Apache Pig as a researcher’s stepping stone

Ben O’Steen @benosteen

[email protected]

www.bl.uk 2

Motivation:

• (Anecdotally) Researchers are motivated by their subject.

– Tools and techniques are interesting to them if it can help further their knowledge and mastery in their chosen field.

www.bl.uk 3

My Problem:

• We have a lot of data. – More that will fit on researcher’s workstations but not what

HPC people consider Big Data™.

www.bl.uk 4

My Problem:



• Different problem to typical HPC:– Ours: Small compute over a series of large, messy datasets– HPC: Large compute over “small” well, characterised input

datasets

www.bl.uk 5

My Problem:



• Different problem to typical HPC:– Ours: Small compute over a series of large, messy datasets– HPC: Large compute over “small” well, characterised input

datasets

• What’s the minimum a researcher needs to learn, in order to make use of compute clouds?

www.bl.uk 6

What choices are there?

• Excel, while ubiquitous, has limitations especially when dealing with semi-structured data.

• OpenRefine is a fine choice, but has its own pros and cons.

• General purpose computing environment– I’m biased but this is a great choice but not an easy sell to

task-focussed people.

• Tailored computuing environment– R, SciPy, MatLab, and so on.

www.bl.uk 7

What about Hadoop?

• Industry backing and use.

• Open and subscription-free.

• Write once, run on any cluster– Well, mostly.

• Clusters can be ‘spun up’ on demand from a number of providers (eg AWS, Azure)

• Lovely. But…

www.bl.uk 8

Researchers and distributed computing

• The idea of trying to teach Map-Reduce or related techniques to a task-focussed researcher doesn’t appeal.

www.bl.uk 9

Researchers and distributed computing

• The idea of trying to teach how to do Map-Reduce in Java to a task-focussed researcher doesn’t appeal at all.

www.bl.uk 10

Hiding Hadoop

• Large number of projects built on top of Hadoop– Using the Hadoop framework, but presenting a different way

to utilise it

• Hbase, Mahout, Hive, and of course, Pig

www.bl.uk 11

Why Pig?

• From the wiki:

“Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing.”

http://hadoop.apache.org/pig/

http://hadoop.apache.org/pig/

www.bl.uk 12

www.bl.uk 13

Pig’s Philosophy

• Pigs eat anything

• Pigs live anywhere

• Pigs are domestic animals

• Pigs fly

(from Programming Pig, by Alan Gates)

www.bl.uk 14

What does Pig Latin look like?

raw = LOAD 'c19/metadatalist' AS (id, pubdate);

dates = FOREACH raw GENERATE id as id, pubdate as pubdate;

date_group = GROUP dates BY pubdate;

STORE date_group INTO 'c19/date_group';

www.bl.uk 15

Write once…

• The pig script couldn’t care less whether:

– the dataset is 12 Mb or 12 Tb– it is running on a small VM or a huge cluster– the dataset is a sample dataset only

www.bl.uk 16

Some tips

• Distributed computing’s Hello World is a word-count

(a.txt is a big list of words, one per line)

a = load 'a.txt';

b = group a all;

c = foreach b generate COUNT(a) as num_rows;

www.bl.uk 17

Some tips

• “sample = SAMPLE raw 0.01”– Keyword that will take a random sampling (0.01 or 1%) of

some source data (‘raw’), rather than process the lot. Great for testing.

www.bl.uk 18

BNB and C19thC scripts

• See https://github.com/bl-labs

https://github.com/bl-labs

https://github.com/bl-labs

www.bl.uk 19

Thank you

Apache pig as a researcher’s stepping stone

Education

Transcript of Apache pig as a researcher’s stepping stone