Download - CSI conference PPT on Performance Analysis of Map/Reduce to compute the frequencey of tweets

“25th CSI Karnataka Student Convention”

Map/Reduce Algorithm Performance Analysis in Computing Frequency of

Tweets

Shravanthi U M & Nagashree NInformation Science and Engineering

Bangalore Institute of Technology, Bangalore

AGENDA

DataBig DataTwitter and Big DataClassical ApproachWhy hadoop FrameworkMap/ReduceOur Proposed ApproachConclusionQ & A

Its all About Data

STRUCTURED

UNSTRUCTURED

Big Data

Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. Ex : Web logs , Social Network data , Internet Search Index etc.

“BigData”

Classical Approach

egrep _____ files[0-1000]

Remote FileSystem

egrepfile0

file1000

egrep

egrep

Hadoop Framework Fault tolerance Streaming data access - HDFS

emphasizes high throughput. Extreme scalability - HDFS will

scale to petabytes; Example: at Facebook.

Portability - HDFS is portable across operating systems.

Write once read many times Locality of computation -move

the program near to the data

HDFSegrep _____ files[0-1000]

Move Computation to Data

egrepfile0

file1000

egrep

egrep

40 nodes/rack f1000f1000f3f3

f0f0

f2f2

f_f_

f_f_

f_f_

f_f_

f_f_

Map/Reduce

Map()

InputAny file

(e.g. documents)

OutputStream of <key, value> pairs

(e.g. <word, count> pairs)

Dat

a Re

dist

ribut

ion

and

Gro

upin

g

InputAll <key, value> pairs with

the same key grouped(e.g. all <word, count> pairs

where word = “the”)

OutputAnything

(e.g. sum of counts for a specific word)

Reduce()

Advantages:Fine-grained Map and Reduce tasks

◦Improved load balancing◦Faster recovery from failed tasks

Automatic re-execution on failure◦In a large cluster, some nodes are always slow or

flaky◦Framework re-executes failed tasks

Locality optimizations◦Map-Reduce queries HDFS for locations of input

data◦When possible, map tasks are scheduled close to

the inputs (local access, local rack access, remote rack access)

What did we do…Python code to extract tweets using “twitter.Search” API

for i in range(10): turl=urllib.urlopen("http://search.twitter.com/

search.atom?lang=en&q="+AnnaHazare+"&rpp=100& page="+str(i))

tweettext=re.findall('<updated>(.*?)</updated>', turl.read()) print "Got the Page No. ",(i+1) for i in tweettext: tweets.append(i) f.write(i+"\n")

Extracted DATA

Map/Reduce Impelmentation

<6/4/11, 1><6/4/11, 1><6/4/11, 1><6/6/11, 1><6/6/11, 1><6/6/11, 1><15/8/11, 1><15/8/11, 1>

Reduce()

<6/4/11, 1><6/4/11, 1><6/4/11, 1><6/4/11, 1><6/4/11, 1>

<6/6/11,1><6/6/11,1><6/6/11,1>

<15/8/11,1><15/8/11,1><15/8/11,1>

Server 1 Final Result File

6/4/11 85

6/6/11 36

15/8/11 125

Reduce()

Reduce()

What’s UNIQUE…

Business Analytics - Considerable approach to spot popularity of “New Product”

Sentimental Analysis

Thank You!