Rapid Development of Big Data applications using Spring for Apache Hadoop
-
Upload
zenyk -
Category
Technology
-
view
112 -
download
3
description
Transcript of Rapid Development of Big Data applications using Spring for Apache Hadoop
Spring for Apache Hadoop
By Zenyk Matchyshyn
Agenda• Goals of the project• Hadoop Introduction• High level support• Workflows• Scripting & Migration• Alternatives• Testing & Related
Big Data – Why?Because of Terabytes and Petabytes:
• Smart meter analysis• Genome processing• Sentiment & social media analysis• Network capacity trending & management• Ad targeting• Fraud detection
Goals• Provide programmatic model to work with
Hadoop ecosystem• Simplify client libraries usage• Provide Spring friendly wrappers• Enable real-world usage as a part of
Spring Batch & Spring Integration• Leverage Spring features
Supported distros
• Apache Hadoop 1.2.1/2.0.6/2.2.0• Cloudera CDH4• Hortonworks HDP 1.3• Pivotal HD 1.0/1.1
HADOOP INTRODUCTION
Hadoop
Hadoop Map/Reduce
HDFS
HBase
Pig Hive
Hadoop basics
Split Map Shuffle Reduce
Dog ate the boneCat ate the fish
Dog, 1Ate, 1The, 1 Bone, 1Cat, 1Ate, 1The, 1Fish,1
Dog, 1Ate, {1, 1}The, {1, 1} Bone, 1Cat, 1Fish,1
Dog, 1Ate, 2The, 2 Bone, 1Cat, 1Fish,1
Configuration< … XML …>
<context:property-placeholder location="hadoop.properties"/>
<hdp:configuration>fs.default.name=${hd.fs}mapred.job.tracker=${hd.jt}
</hdp:configuration>
<… XML … >
Job definition<hdp:job id=“hadoopJob"
input-path="${wordcount.input.path}" output-path="${wordcount.output.path}"libs="file:${app.repo}/supporting-lib-*.jar"mapper="org.company.Mapper"reducer="org.company.Reducer"/>
Configuration conf = new Configuration();
Job job = new Job(conf, “hadoopJob");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(Maper.class);job.setReducerClass(Reducer.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
Job Execution
<hdp:job-runner id="runner" run-at-startup="true" pre-action=“someScript“
post-action=“someOtherScript“ job-ref=“hadoopJob" />
• Basic:
• Scheduled– TaskScheduler– Quartz
• Custom
HIGH LEVEL TOOLS
Solutions
• HBase• Hive• Pig• Cascading
Simplifies• Thread safety• DAO friendliness, wrappers and basic
mappers• Simple connection interfaces• Runners, Template and callback
methods• Common scenarios simplifications• Scripting support
Example - Template
template.execute("MyTable", new TableCallback<Object>() {
@Override public Object doInTable(HTable table) throws Throwable { Put p = new Put(Bytes.toBytes("SomeRow")); p.add(Bytes.toBytes("SomeColumn"), Bytes.toBytes("SomeQualifier"), Bytes.toBytes("AValue")); table.put(p);
return null; }
});
<hdp:hbase-configuration/>
<bean id="hbaseTemplate" class="org.springframework.data.hadoop.hbase.HbaseTemplate" p:configuration-ref="hbaseConfiguration"/>
Example – Script Runner<hdp:hive-server host=“hivehost" port="10001" />
<hdp:hive-template />
<hdp:hive-client-factory host="some-host" port="some-port" > <hdp:script location="classpath:org/company/hive/script.q">
<arguments>ignore-case=true</arguments> </hdp:script> </hdp:hive-client-factory>
<hdp:hive-runner id="hiveRunner" run-at-startup="true"> <hdp:script> DROP TABLE IF EXITS testHiveBatchTable; CREATE TABLE testHiveBatchTable (key int, value string); </hdp:script> <hdp:script location="hive-scripts/script.q"/> </hdp:hive-runner>
WORKFLOWS
Typical Big Data Processing Flow
Capture Pre-Process Insert Process Extract Present
Spring Batch & Spring Integration
• Big Data Flows are based on Spring Integration & Spring Batch
• Spring for Hadoop provides:– Spring Batch tasklets– Spring Integration support
Tasklets
• Job runners• Script runners• Hive • Pig• Cascading
Example
<hdp:job-tasklet id="hadoop-tasklet" job-ref="mr-job" wait-for-completion="true" />
<batch:job id="job1"> <batch:step id="import" next=“ht"> <batch:tasklet ref="script-tasklet"/> </batch:step> <batch:step id=“ht"> <batch:tasklet ref=" hadoop-tasklet" /></batch:step> </batch:job>
SCRIPTING & MIGRATION
Details
• Supports JVM languages from JSR-223 (Groovy, JRuby, Jython, Rhino)
• Exposes SimplerFileSystem• Provides implicit variables• Exposes FsShell to mimic HDFS shell• Exposes DistCp to mimic distcp from
Hadoop
Example<hdp:script-tasklet id="script-tasklet"> <hdp:script language="groovy">
inputPath = "/user/gutenberg/input/word/" outputPath = "/user/gutenberg/output/word/"
if (fsh.test(inputPath)) { fsh.rmr(inputPath) }
if (fsh.test(outputPath)) { fsh.rmr(outputPath) }
inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"
fsh.put(inputFile, inputPath)
</hdp:script> </hdp:script-tasklet>
MigrationHadoop Streaming:
Hadoop Tool Executor:
<hdp:streaming id="streaming" input-path="/input/" output-path="/ouput/" mapper="${path.cat}" reducer="${path.wc}"/>
<hdp:tool-runner id="someTool" tool-class="org.foo.SomeTool" run-at-startup="true"> <hdp:arg value="data/in.txt"/>
<hdp:arg value="data/out.txt"/> property=value
</hdp:tool-runner>
Alternatives
• Apache Flume – distributed data collection• Apache Oozie – workflow scheduler• Apache Sqoop – SQL bulk import/export
TESTING & RELATED TOOLS
Testing
• JUnit/Mocks + MRUnit• Mini-HDFS and Mini-MapReduce
cluster• LocalJobRunner
Spring YARN
HDFSstorage
Map/Reducecluster / data process
YARNcluster
HDFSstorage
Map/Reducedata process
Otherlike Spark - data
Hadoop 1.x Hadoop 2.x
Spring eXtreme Data (XD)
• Ultimate data processing solution• Implements most common approach,
business logic up to you• On top of Spring Batch and Spring
Integration• Has DSL• Scalable
More speedups• Use provider quick start VM for initial
development• Use cloud based images for production
(start/stop)• Don’t use Map/Reduce without real need.
Start with higher abstraction.• Don’t migrate without real need!• Invest in DevOps (Chef / Puppet /
Vagrant…)
Q/A
?