To Infinity and Beyond - OSDConf2014
-
Upload
pranav-prakash -
Category
Technology
-
view
243 -
download
0
description
Transcript of To Infinity and Beyond - OSDConf2014
TO INFINITY AND BEYOND
Pranav Prakash
in.linkedin.com/in/prakashpranavSearch @LinkedIn
Hari Prasanna
in.linkedin.com/in/mostlycachedBigData @LinkedIn
The story of how solving one problem the OpenSource way opened doors to so much more
LUCENE
Information Retrieval Library
Started in 1999 as SourceForge.net project
Joins Apache in 2001 in Jakarta’s family
Top Level Project in 2005
LinkedIn, Twitter, Comcast
From a single tool to an ecosystem
• Breaking away from the initial problem statement • The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to
HDFS, HBase and Giraph • The thrill and chaos of working with alpha software - from dealing with
compatibility issues to being a part of active development • Interoperability between various systems • Ever widening scope of the project and leveraging other tools in the
ecosystem
• Features:
• Distributed storage - HDFS
• Distributed processing - MapReduce
• Fault tolerance
• Horizontal scalability
• Comparisons
• RDBMS
• Grid computing
• Use Cases
• Analytics (trends, predictions, summaries etc.,)
• Searching and Indexing
Hadoop
• Features:
• Column based storage
• Horizontal scalability
• Low latency reads
• MapReduce support
• SQL Support with Phoenix
• Coprocessors and secondary indexes
• RDBMS vs HBase
• Use cases
• Facebook messages
• Monitoring with openTSDB
HBase
Vanilla MapReduce
!!!!!Higher Abstractions
• Pig - data flow language
• Hive - SQL to MapReduce adapter
• Cascading - Pipeline primitives and other powerful abstractions
• Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like datafu
Java MapReduceHaving run through how the MapReduce program works, the next step is to express itin code. We need three things: a map function, a reduce function, and some code torun the job. The map function is represented by the Mapper class, which declares anabstract map() method. Example 2-3 shows the implementation of our map method.
Example 2-3. Mapper for maximum temperature example
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;
public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } }}
The Mapper class is a generic type, with four formal type parameters that specify theinput key, input value, output key, and output value types of the map function. For thepresent example, the input key is a long integer offset, the input value is a line of text,
Figure 2-1. MapReduce logical data flow
22 | Chapter 2:ಗMapReduce
Data Processing
• Data collection, aggregation and forwarding with Kafka, Flume, Scribe.
• Real time stream processing with Storm to enable online machine learning, real time analytics in twitter, groupon.
• Graph processing a trillion edges in facebook with Apache Giraph
• Quickstarting with the cloudera distribution
• Getting one step through the door - SlideShare’s journey
• Can your app survive without it? - Raising your bar
• Programmer, Administrator, DBA, Data Scientist - what hat are you wearing today?
• The road ahead
• Keeping track of the developments and giving back
Leveraging “Big Data”
• Scientific Research - Scihadoop, decoding DNA
• Finance - Fraud Detection, Algorithmic trading, Risk Management
• Web - Network Analysis, Recommendation Engines, Personalization
• Government - Election campaigns, intelligence systems
• Supply chain optimization, Weather forecasting
In the Wild