Simplifying Big Data with Apache Crunch
Transcript of Simplifying Big Data with Apache Crunch
![Page 1: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/1.jpg)
Simplifying Big Data with Apache Crunch
Micah Whitacre@mkwhit
![Page 2: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/2.jpg)
![Page 3: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/3.jpg)
![Page 4: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/4.jpg)
![Page 5: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/5.jpg)
![Page 6: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/6.jpg)
![Page 7: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/7.jpg)
![Page 8: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/8.jpg)
![Page 9: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/9.jpg)
![Page 10: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/10.jpg)
![Page 11: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/11.jpg)
Semantic Chart Search
Cloud Based EMR
Medical Alerting System
Population Health Management
![Page 12: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/12.jpg)
Problem moves from scaling architecture...
![Page 13: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/13.jpg)
Problem moves from not only scaling architecture...
To how to scale the knowledge
![Page 14: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/14.jpg)
![Page 15: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/15.jpg)
Battling the 3 V’s
![Page 16: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/16.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
![Page 17: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/17.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
60+ different data formats
![Page 18: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/18.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
60+ different data formats
Constant streams for near real time
![Page 19: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/19.jpg)
Battling the 3 V’s
Daily, weekly, monthly uploads
2+ TB of streaming data daily
60+ different data formats
Constant streams for near real time
![Page 20: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/20.jpg)
Normalize Data
Apply Algorithms
Load Data for
Displays
HBaseSolrVertica
AvroCSVVerticaHBase
Population Health
![Page 21: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/21.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 22: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/22.jpg)
Mapper
Reducer
![Page 23: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/23.jpg)
Struggle to fit into single MapReduce job
![Page 24: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/24.jpg)
Struggle to fit into single MapReduce job
Integration done through persistence
![Page 25: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/25.jpg)
Struggle to fit into single MapReduce job
Integration done through persistence
Custom impls of common patterns
![Page 26: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/26.jpg)
Struggle to fit into single MapReduce job
Integration done through persistence
Custom impls of common patterns
Evolving Requirements
![Page 27: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/27.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
AnonymizeData Avro
Prep for Bulk Load HBase
![Page 28: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/28.jpg)
Easy integration between teams
Focus on processing steps
Shallow learning curve
Ability to tune for performance
![Page 29: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/29.jpg)
Apache CrunchCompose processing into pipelines
Open Source FlumeJava impl
Utilizes POJOs (hides serialization)
Transformation through fns (not job)
![Page 30: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/30.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 31: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/31.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
Processing Pipeline
![Page 32: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/32.jpg)
PipelineProgrammatic description of DAG
Supports lazy execution
MapReduce, Spark, Memory
Implementations indicate runtime
![Page 33: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/33.jpg)
Pipeline pipeline = MemPipeline.getIntance();
Pipeline pipeline = new MRPipeline(Driver.class, conf);
Pipeline pipeline = new SparkPipeline(sparkContext, “app”);
![Page 34: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/34.jpg)
SourceReads various inputs
At least one required per pipeline
Custom implementations
Creates initial collections for processing
![Page 35: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/35.jpg)
SourceSequence Files
Avro ParquetHBaseJDBCHFilesTextCSV
StringsAvroRecords
ResultsPOJOs
ProtobufsThrift
Writables
![Page 36: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/36.jpg)
pipeline.read(From.textFile(path));
![Page 37: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/37.jpg)
pipeline.read(new TextFileSource(path,ptype));
![Page 38: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/38.jpg)
PType<String> ptype = …;pipeline.read(new TextFileSource(path,ptype));
![Page 39: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/39.jpg)
PTypeHides serialization
Exposes data in native Java forms
Avro, Thrift, and Protocol Buffers
Supports composing complex types
![Page 40: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/40.jpg)
Multiple Serialization Types
Serialization Type = PTypeFamily
Can’t mix families in single type
Avro & Writable available
Can easily convert between families
![Page 41: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/41.jpg)
PType<Integer> intTypes = Writables.ints();
PType<String> stringType = Avros.strings();
PType<Person> personType = Avros.records(Person.class);
![Page 42: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/42.jpg)
PType<Pair<String, Person>> pairType = Avros.pairs(stringType, personType);
![Page 43: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/43.jpg)
PTableType<String, Person> tableType = Avros.tableOf(stringType,personType);
![Page 44: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/44.jpg)
PType<String> ptype = …;PCollection<String> strings = pipeline.read(
new TextFileSource(path, ptype));
![Page 45: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/45.jpg)
PCollectionImmutable
Not created only read or transformed
Unsorted
Represents potential data
![Page 46: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/46.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 47: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/47.jpg)
Process Reference
Data
PCollection<String> PCollection<RefData>
![Page 48: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/48.jpg)
DoFnSimple API to implement
Location for custom logic
Transforms PCollection between forms
Processes one element at a time
![Page 49: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/49.jpg)
For each item emits 0:M items
FilterFn - returns boolean
MapFn - emits 1:1
![Page 50: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/50.jpg)
DoFn API
class ExampleDoFn extends DoFn<String, RefData>{ ...} Type of Data In Type of Data Out
![Page 51: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/51.jpg)
public void process(String s, Emitter<RefData> emitter) { RefData data = …; emitter.emit(data);}
Type of Data In Type of Data Out
![Page 52: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/52.jpg)
PCollection<String> refStringsPCollection<RefData> refs = refStrings.parallelDo(fn, Avros.records(RefData.class));
![Page 53: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/53.jpg)
PCollection<String> dataStrs...PCollection<RefData> refs = dataStrs.parallelDo(diffFn, Avros.records(Data.class));
![Page 54: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/54.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 55: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/55.jpg)
Hmm now I need to join...
We need a PTable
But they don’t have a common key?
![Page 56: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/56.jpg)
PTable<K, V>Immutable & Unsorted
Variation PCollection<Pair<K, V>>
Multimap of Keys and Values
Joins, Cogroups, Group By Key
![Page 57: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/57.jpg)
class ExampleDoFn extends DoFn<String, RefData>{ ...}
![Page 58: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/58.jpg)
class ExampleDoFn extends DoFn<String, Pair<String, RefData>>{
...}
![Page 59: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/59.jpg)
PCollection<String> refStringsPTable<String, RefData> refs = refStrings.parallelDo(fn, Avros.tableOf(Avros.strings(), Avros.records(RefData.class)));
![Page 60: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/60.jpg)
PTable<String, RefData> refs…;PTable<String, Data> data…;
![Page 61: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/61.jpg)
data.join(refs);
(inner join)
![Page 62: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/62.jpg)
PTable<String, Pair<Data, RefData>> joinedData = data.join(refs);
![Page 63: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/63.jpg)
Joinsright, left, inner, outer
Mapside, BloomFilter, Sharded
Eliminates custom impls
![Page 64: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/64.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 65: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/65.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 66: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/66.jpg)
FilterFn API
class MyFilterFn extends FilterFn<...>{ ...} Type of Data In
![Page 67: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/67.jpg)
public boolean accept(... value){ return value > 3;}
![Page 68: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/68.jpg)
PCollection<Model> values = …;PCollection<Model> filtered = values.filter(new MyFilterFn());
![Page 69: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/69.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 70: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/70.jpg)
PTable<String,Model> models = …;
Keyed By PersonId
![Page 71: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/71.jpg)
PTable<String,Model> models = …;PGroupedTable<String, Model> groupedModels = models.groupByKey();
![Page 72: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/72.jpg)
PGroupedTable<K, V>Immutable & Sorted
PCollection<Pair<K, Iterable<V>>>
![Page 73: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/73.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 74: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/74.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 75: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/75.jpg)
PCollection<Person> persons = …;
![Page 76: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/76.jpg)
PCollection<Person> persons = …;pipeline.write(persons, To.avroFile(path));
![Page 77: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/77.jpg)
PCollection<Person> persons = …;pipeline.write(persons, new AvroFileTarget(path));
![Page 78: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/78.jpg)
TargetPersists PCollection
At least one required per pipeline
Custom implementations
![Page 79: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/79.jpg)
TargetSequence Files
Avro ParquetHBaseJDBCHFilesTextCSV
StringsAvroRecords
ResultsPOJOs
ProtobufsThrift
Writables
![Page 80: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/80.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
![Page 81: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/81.jpg)
Pipeline pipeline = …; ...pipeline.write(...);PipelineResult result = pipeline.done();
Execution
![Page 82: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/82.jpg)
Process Reference
Data
Process Raw
Person Data
Process Raw Data
using Reference
Filter Out Invalid Data
Group Data By Person
Create Person Record
Avro
CSV
CSV
Map Reduce Reduce
![Page 83: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/83.jpg)
Tuning
GroupingOptions/ParallelDoOptions
Scale factors
Tweak pipeline for performance
![Page 84: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/84.jpg)
Focus on the transformations
Smaller learning curve
Functionality first
Less fragility
![Page 85: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/85.jpg)
Extend pipeline for new features
Iterate with confidence
Integration through PCollections
![Page 86: Simplifying Big Data with Apache Crunch](https://reader033.fdocuments.in/reader033/viewer/2022051716/58a1abbb1a28ab32438ba0a9/html5/thumbnails/86.jpg)
Linkshttp://crunch.apache.org/
http://www.quora.com/Apache-Hadoop/What-are-the-differences-between-Crunch-and-Cascading
http://dl.acm.org/citation.cfm?id=1806596.1806638