“25th CSI Karnataka Student Convention”
Map/Reduce Algorithm Performance Analysis in Computing Frequency of
Tweets
Shravanthi U M & Nagashree NInformation Science and Engineering
Bangalore Institute of Technology, Bangalore
AGENDA
DataBig DataTwitter and Big DataClassical ApproachWhy hadoop FrameworkMap/ReduceOur Proposed ApproachConclusionQ & A
Its all About Data
STRUCTURED
UNSTRUCTURED
Big Data
Data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.Big data sizes are a constantly moving target currently ranging from a few dozen terabytes to many petabytes of data in a single data set. Ex : Web logs , Social Network data , Internet Search Index etc.
“BigData”
Classical Approach
egrep _____ files[0-1000]
Remote FileSystem
egrepfile0
file1000
egrep
egrep
Hadoop Framework Fault tolerance Streaming data access - HDFS
emphasizes high throughput. Extreme scalability - HDFS will
scale to petabytes; Example: at Facebook.
Portability - HDFS is portable across operating systems.
Write once read many times Locality of computation -move
the program near to the data
HDFSegrep _____ files[0-1000]
Move Computation to Data
egrepfile0
file1000
egrep
egrep
40 nodes/rack f1000f1000f3f3
f0f0
f2f2
f_f_
f_f_
f_f_
f_f_
f_f_
Map/Reduce
Map()
InputAny file
(e.g. documents)
OutputStream of <key, value> pairs
(e.g. <word, count> pairs)
Dat
a Re
dist
ribut
ion
and
Gro
upin
g
InputAll <key, value> pairs with
the same key grouped(e.g. all <word, count> pairs
where word = “the”)
OutputAnything
(e.g. sum of counts for a specific word)
Reduce()
Advantages:Fine-grained Map and Reduce tasks
◦Improved load balancing◦Faster recovery from failed tasks
Automatic re-execution on failure◦In a large cluster, some nodes are always slow or
flaky◦Framework re-executes failed tasks
Locality optimizations◦Map-Reduce queries HDFS for locations of input
data◦When possible, map tasks are scheduled close to
the inputs (local access, local rack access, remote rack access)
What did we do…Python code to extract tweets using “twitter.Search” API
for i in range(10): turl=urllib.urlopen("http://search.twitter.com/
search.atom?lang=en&q="+AnnaHazare+"&rpp=100& page="+str(i))
tweettext=re.findall('<updated>(.*?)</updated>', turl.read()) print "Got the Page No. ",(i+1) for i in tweettext: tweets.append(i) f.write(i+"\n")
Extracted DATA
Map/Reduce Impelmentation
<6/4/11, 1><6/4/11, 1><6/4/11, 1><6/6/11, 1><6/6/11, 1><6/6/11, 1><15/8/11, 1><15/8/11, 1>
Reduce()
<6/4/11, 1><6/4/11, 1><6/4/11, 1><6/4/11, 1><6/4/11, 1>
<6/6/11,1><6/6/11,1><6/6/11,1>
<15/8/11,1><15/8/11,1><15/8/11,1>
Server 1 Final Result File
6/4/11 85
6/6/11 36
15/8/11 125
Reduce()
Reduce()
What’s UNIQUE…
Business Analytics - Considerable approach to spot popularity of “New Product”
Sentimental Analysis
Thank You!
Top Related