The Researcher’s Choice of Diverse Methodologies – “Principal” Considerations
Apache pig as a researcher’s stepping stone
Transcript of Apache pig as a researcher’s stepping stone
www.bl.uk 2
Motivation:
• (Anecdotally) Researchers are motivated by their subject.
– Tools and techniques are interesting to them if it can help further their knowledge and mastery in their chosen field.
www.bl.uk 3
My Problem:
• We have a lot of data. – More that will fit on researcher’s workstations but not what
HPC people consider Big Data™.
www.bl.uk 4
My Problem:
• We have a lot of data. – More that will fit on researcher’s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:– Ours: Small compute over a series of large, messy datasets– HPC: Large compute over “small” well, characterised input
datasets
www.bl.uk 5
My Problem:
• We have a lot of data. – More that will fit on researcher’s workstations but not what
HPC people consider Big Data™.
• Different problem to typical HPC:– Ours: Small compute over a series of large, messy datasets– HPC: Large compute over “small” well, characterised input
datasets
• What’s the minimum a researcher needs to learn, in order to make use of compute clouds?
www.bl.uk 6
What choices are there?
• Excel, while ubiquitous, has limitations especially when dealing with semi-structured data.
• OpenRefine is a fine choice, but has its own pros and cons.
• General purpose computing environment– I’m biased but this is a great choice but not an easy sell to
task-focussed people.
• Tailored computuing environment– R, SciPy, MatLab, and so on.
www.bl.uk 7
What about Hadoop?
• Industry backing and use.
• Open and subscription-free.
• Write once, run on any cluster– Well, mostly.
• Clusters can be ‘spun up’ on demand from a number of providers (eg AWS, Azure)
• Lovely. But…
www.bl.uk 8
Researchers and distributed computing
• The idea of trying to teach Map-Reduce or related techniques to a task-focussed researcher doesn’t appeal.
www.bl.uk 9
Researchers and distributed computing
• The idea of trying to teach how to do Map-Reduce in Java to a task-focussed researcher doesn’t appeal at all.
www.bl.uk 10
Hiding Hadoop
• Large number of projects built on top of Hadoop– Using the Hadoop framework, but presenting a different way
to utilise it
• Hbase, Mahout, Hive, and of course, Pig
www.bl.uk 11
Why Pig?
• From the wiki:
“Apache Pig is a platform for analyzing large data sets. Pig's language, Pig Latin, lets you specify a sequence of data transformations such as merging data sets, filtering them, and applying functions to records or groups of records. Pig comes with many built-in functions but you can also create your own user-defined functions to do special-purpose processing.”
www.bl.uk 12
www.bl.uk 13
Pig’s Philosophy
• Pigs eat anything
• Pigs live anywhere
• Pigs are domestic animals
• Pigs fly
(from Programming Pig, by Alan Gates)
www.bl.uk 14
What does Pig Latin look like?
raw = LOAD 'c19/metadatalist' AS (id, pubdate);
dates = FOREACH raw GENERATE id as id, pubdate as pubdate;
date_group = GROUP dates BY pubdate;
STORE date_group INTO 'c19/date_group';
www.bl.uk 15
Write once…
• The pig script couldn’t care less whether:
– the dataset is 12 Mb or 12 Tb– it is running on a small VM or a huge cluster– the dataset is a sample dataset only
www.bl.uk 16
Some tips
• Distributed computing’s Hello World is a word-count
(a.txt is a big list of words, one per line)
a = load 'a.txt';
b = group a all;
c = foreach b generate COUNT(a) as num_rows;
www.bl.uk 17
Some tips
• “sample = SAMPLE raw 0.01”– Keyword that will take a random sampling (0.01 or 1%) of
some source data (‘raw’), rather than process the lot. Great for testing.
www.bl.uk 18
BNB and C19thC scripts
• See https://github.com/bl-labs
www.bl.uk 19
Thank you