To Infinity and Beyond - OSDConf2014

21
TO INFINITY AND BEYOND Pranav Prakash in.linkedin.com/in/prakashpranav Search @LinkedIn Hari Prasanna in.linkedin.com/in/mostlycached BigData @LinkedIn The story of how solving one problem the OpenSource way opened doors to so much more

description

The story of how solving one problem the OpenSource way opened doors to so much more. Talk presented by Pranav Prakash and Hari Prasanna at OSDConf 2014, New Delhi.

Transcript of To Infinity and Beyond - OSDConf2014

TO INFINITY AND BEYOND

Pranav Prakash

in.linkedin.com/in/prakashpranavSearch @LinkedIn

Hari Prasanna

in.linkedin.com/in/mostlycachedBigData @LinkedIn

The story of how solving one problem the OpenSource way opened doors to so much more

OpenSource Chain Reaction

How “it” begins

OpenSource Chain Reaction

How “it” begins

How “it” grows

OpenSource Chain Reaction

How “it” begins

How “it” grows

How “it” contributes

LUCENE

Information Retrieval Library

Started in 1999 as SourceForge.net project

Joins Apache in 2001 in Jakarta’s family

Top Level Project in 2005

LinkedIn, Twitter, Comcast

LUCENE

IR requirements

What would you do next?

Be better at searching

Crawl the web

Web Wrapper around Lucene

Full Text Search, NRT Indexing

Faceted Search, Clustering

NUTCH

Web Crawler

Billions of pages on the internet

Alternate to commercial engines

From a single tool to an ecosystem

• Breaking away from the initial problem statement • The Google factor - GFS(2003), BigTable(2006), Pregel(2009) leading to

HDFS, HBase and Giraph • The thrill and chaos of working with alpha software - from dealing with

compatibility issues to being a part of active development • Interoperability between various systems • Ever widening scope of the project and leveraging other tools in the

ecosystem

Ecosystem

• Features:

• Distributed storage - HDFS

• Distributed processing - MapReduce

• Fault tolerance

• Horizontal scalability

• Comparisons

• RDBMS

• Grid computing

• Use Cases

• Analytics (trends, predictions, summaries etc.,)

• Searching and Indexing

Hadoop

• Features:

• Column based storage

• Horizontal scalability

• Low latency reads

• MapReduce support

• SQL Support with Phoenix

• Coprocessors and secondary indexes

• RDBMS vs HBase

• Use cases

• Facebook messages

• Monitoring with openTSDB

HBase

Vanilla MapReduce

!!!!!Higher Abstractions

• Pig - data flow language

• Hive - SQL to MapReduce adapter

• Cascading - Pipeline primitives and other powerful abstractions

• Even higher abstractions with Cascalog(cascading + prolog), PigPen(clojure for pig) and Pig libraries like datafu

Java MapReduceHaving run through how the MapReduce program works, the next step is to express itin code. We need three things: a map function, a reduce function, and some code torun the job. The map function is represented by the Mapper class, which declares anabstract map() method. Example 2-3 shows the implementation of our map method.

Example 2-3. Mapper for maximum temperature example

import java.io.IOException;

import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;

public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } }}

The Mapper class is a generic type, with four formal type parameters that specify theinput key, input value, output key, and output value types of the map function. For thepresent example, the input key is a long integer offset, the input value is a line of text,

Figure 2-1. MapReduce logical data flow

22 | Chapter 2:ಗMapReduce

Data Processing

• Data collection, aggregation and forwarding with Kafka, Flume, Scribe.

• Real time stream processing with Storm to enable online machine learning, real time analytics in twitter, groupon.

• Graph processing a trillion edges in facebook with Apache Giraph

• Quickstarting with the cloudera distribution

• Getting one step through the door - SlideShare’s journey

• Can your app survive without it? - Raising your bar

• Programmer, Administrator, DBA, Data Scientist - what hat are you wearing today?

• The road ahead

• Keeping track of the developments and giving back

Leveraging “Big Data”

• Scientific Research - Scihadoop, decoding DNA

• Finance - Fraud Detection, Algorithmic trading, Risk Management

• Web - Network Analysis, Recommendation Engines, Personalization

• Government - Election campaigns, intelligence systems

• Supply chain optimization, Weather forecasting

In the Wild