Cascading for the Impatient
-
Upload
paco-nathan -
Category
Technology
-
view
2.647 -
download
0
Transcript of Cascading for the Impatient
![Page 1: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/1.jpg)
Cascading for the ImpatientPaco NathanConcurrent, Inc.
[email protected]@pacoid
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Copyright @2012, Concurrent, Inc.
![Page 2: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/2.jpg)
Unstructured Data meets Enterprise Scale
why?
![Page 3: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/3.jpg)
Cascading.org/how?
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
![Page 4: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/4.jpg)
• Business Stakeholder POV:business process management for workflow orchestration (think BPM/BPEL)
• Systems Integrator POV:system integration of heterogenous data sources and compute platforms
• Data Scientist POV:a directed, acyclic graph (DAG) on which we can apply Amdahl's Law
• Data Architect POV:a physical plan for large-scale data flow management
• Software Architect POV:a pattern language, similar to plumbing or circuit design
• App Developer POV:API bindings for Scala, Clojure, Python, Ruby, Java
• Systems Engineer POV:a JAR file, has passed CI, available in a Maven repo
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
who?
![Page 5: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/5.jpg)
Scala, Clojure, Python, Ruby, Java, etc.…envision whatever else runs in a JVM
where?
Nagios, etc.
(raw human intellect, unless…)
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
Domain expertise, business trade-offs,operating parameters, etc.
Apache Hadoop, in-memory local mode…envision GPUs, other frameworks, etc.
business process
APIlanguage
logical plan/ optimize
physicalplan
compute framework
monitors, notification
“asse
mb
ler”
cod
e
![Page 6: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/6.jpg)
1: copy
Source
Sink
M
public class Main { public static void main( String[] args ) { String inPath = args[ 0 ]; String outPath = args[ 1 ];
Properties props = new Properties(); AppProps.setApplicationJarClass( props, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( props );
// create the source tap Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath );
// create the sink tap Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath );
// specify a pipe to connect the taps Pipe copyPipe = new Pipe( "copy" );
// connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "copy" ) .addSource( copyPipe, inTap ) .addTailSink( copyPipe, outTap );
// run the flow flowConnector.connect( flowDef ).complete(); } } 1 mapper
0 reducers10 lines code
![Page 7: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/7.jpg)
ten lines of code for a file copy …seems like a lot.
wait!
![Page 8: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/8.jpg)
same JAR, any scale…
Your Mom’s Laptop:Mb’s dataHadoop standalone modepasses unit tests, or notruntime: seconds – minutes
Staging Cluster:Gb’s dataEMR + 4 Spot InstancesCI shows red or green lightsruntime: minutes – hours
Production Cluster:Tb’s dataEMR + 50 HPC InstancesOps monitors resultsruntime: hours – days
MegaCorp Enterprise IT:Pb’s data1000+ node clusterEVP calls you when app failsruntime: days+
![Page 9: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/9.jpg)
2: word count
DocumentCollection
WordCount
TokenizeGroupBytoken Count
R
M
1 mapper 1 reducer18 lines code
![Page 10: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/10.jpg)
3: wc + scrub
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken Count
M
R
1 mapper 1 reducer22+10 lines code
![Page 11: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/11.jpg)
4: wc + scrub + stop words
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
HashJoinLeft
RHS
M
R
1 mapper 1 reducer28+10 lines code
![Page 12: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/12.jpg)
5: tf-idf
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
token
TF
GroupBydoc_id, token Count
D Uniquedoc_id
Insert1
SumBydoc_id
HashJoinLeft
RHS
HashJoin
RHS
DF Unique
tokenGroupBy
token Count CoGroup
RHS
ExprFunctf-idf
TF-IDF
M
R
R
R
R
RR
RM
M
M RM
M
M
RM
M
M
M
11 mappers 9 reducers65+10 lines code
![Page 13: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/13.jpg)
6: tf-idf + tdd
Scrubtoken
DocumentCollection
Tokenize
WordCount
GroupBytoken
Count
Stop WordList
Regextoken
token
TF
GroupBydoc_id, token Count
D Uniquedoc_id
Insert1
SumBydoc_id
HashJoinLeft
RHS
HashJoin
RHS
DF Unique
tokenGroupBy
token CountCoGroup
RHS
ExprFunctf-idf
TF-IDF
Assert
FailureTraps
CheckpointM
R
R
R
R
RR
RM
M
M RM
M
M
RM
M
M
M
M
12 mappers 9 reducers76+14 lines code
![Page 14: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/14.jpg)
deployed…
elastic-mapreduce --create --name "TF-IDF" \ --jar s3n://temp.cascading.org/impatient/part6.jar \ --arg s3n://temp.cascading.org/impatient/rain.txt \ --arg s3n://temp.cascading.org/impatient/out/wc \ --arg s3n://temp.cascading.org/impatient/en.stop \ --arg s3n://temp.cascading.org/impatient/out/tfidf \ --arg s3n://temp.cascading.org/impatient/out/trap \ --arg s3n://temp.cascading.org/impatient/out/check
![Page 15: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/15.jpg)
results?
doc_id textdoc01 A rain shadow is a dry area on the lee back side of a mountainous area.doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.doc05 Two Women. Secrets. A Broken Land. [DVD Australia]zoink null
doc_id tf-idf tokendoc02 0.9163 airdoc05 0.9163 australiadoc05 0.9163 brokendoc04 0.9163 california'sdoc04 0.9163 causedoc02 0.9163 cloudcoverdoc04 0.9163 deathdoc04 0.9163 desertsdoc03 0.9163 downwind …doc02 0.9163 sinkingdoc04 0.9163 suchdoc04 0.9163 valleydoc05 0.9163 womendoc03 0.5108 landdoc05 0.5108 landdoc01 0.5108 leedoc02 0.5108 leedoc03 0.5108 leewarddoc04 0.5108 leewarddoc01 0.4463 areadoc02 0.2231 areadoc03 0.2231 areadoc01 0.2231 drydoc02 0.2231 drydoc03 0.2231 drydoc02 0.2231 mountaindoc03 0.2231 mountaindoc04 0.2231 mountaindoc01 0.0000 raindoc02 0.0000 raindoc03 0.0000 raindoc04 0.0000 raindoc01 0.0000 shadowdoc02 0.0000 shadowdoc03 0.0000 shadowdoc04 0.0000 shadow
![Page 16: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/16.jpg)
comparisons?
compare similar code in Scalding and Cascalog:
sujitpal.blogspot.com/2012/08/scalding-for-impatient.html
based on: github.com/twitter/scalding/wiki
github.com/Quantisan/Impatient
based on: github.com/nathanmarz/cascalog/wiki
![Page 17: Cascading for the Impatient](https://reader034.fdocuments.in/reader034/viewer/2022052116/55d56545bb61eb3c3a8b4629/html5/thumbnails/17.jpg)
blog, code, wiki, gists, jars, list, DevOps products:
cascading.org/category/impatient/github.org/Cascading/conjars.org/goo.gl/KQtULconcurrentinc.com/
drill-down?