Partnership for International Research and Education

1
W hy the need to cope w ith heterogeneousdatasets? The availability oflinked open data (LOD)and otherheterogeneous data allow susto dynam ically add functionality to ourapplications OurGPS navigation system m aynothave a listofall M cDonalds restaurantsbut, ifw e tell itw here to find a the inform ation then a m ethod such asIBM ’sM idascan help usextractthe data and w e can instantly provide new functionality. M apReduce allow susto focuslesson efficiency and m ore on dream ing up new contentavenuesw e can open Once w e see a consistentneed forthe functionality w e can proceed to develop an efficientnon-m apReduce solution. M apReduce allow susto provide the functionalityrightaw ayw ithout significanteffortand developm ent Unsorted Spatial Data Exam ple Given a pointα, a setofkeyw ordsε, and a collection ofdatasetsβ Return a listofobjectsin β thatcontain som e orall ε i aggregated by “distance range”. Given α= (80.98, -127.356) ε = {M cDonalds} Result: Listof All M cDonaldsbetw een 0-5 m ilesofα All M cDonaldsbetw een 6-10 m ilesofα All M cDonaldsbetw een 11-20 m ilesofα All M cDonalds21 m ilesorm ore aw ayfrom α Paradigm can be tuned quite easily form anydifferentdomains! M apReduceImplementation Inputto M apper<β i i,j > sim ilarityM easure = δ(α, β i,j ) κ = w hichBucket(sim ilarityM easure) outputInterm ittentKeyValue(κ, β i,j ) Interm ediate Output<key κ, Objectβ i,j > Com binerwill com bine these bykey Inputinto Reducer<key κ, < γ> ObjectIterator> ReducerOutput<key κ, γ[]Object> In the case ofthe described spatial application the resultis sim ple a concatenated file ofall ofthe objectsin the ObjectIterator The resultcan be in anyform atw e specify; array, iterator, string, etc. Overview Releases Releases Speeches Speeches Advisories Advisories Transcripts Transcripts HDFS HTM LDocs Custom Java Crawler DoD W ebsite M apReduce Paradigm Each file isprocessed asa record <k1, v1> <NullW ritable, BytesW ritable fileContents> Who , What, When isextracted and em itted <k2, v2> – <Text Who , Text When > – <Text When , Text What> – <Text When , Text Who > •ThreePossibilitiesforReduction: Reduceraggregatesbased on Who and returnsa listof Whens • <Text Who , BytesW ritable Whens > Reduceraggregatesbased on When and returnsa listof Whats • <Text When , BytesW ritable Whats > Reduceraggregatesbased on When and returnsa listof Whos • <Text When , BytesW ritable Whos > Data • Approximately 4,000 HTM Lfilesin each ofthe 4 categories. New filesare added daily Size rangesfrom 34k to 500k Data w asduplicated to reach Hadoop file quantity lim it. M apperw ill processentire file asrecord Attem ptw ill be m ade to adaptcode to read one line ata tim e asw ell and outputaggregate counts. i.e. How m anyspeechesw ere m ade on July 4, 1996? Result A searchable data structure containing inform ation aboutall DoD publicationssince 1994. Query frontend can easily be developed Inform ation discovery and data m ining possibilitiesw ill exist. A DoD PublicationsArchive Search Tool Using M apReduce Sam ple Application Find all speechesand transcriptsofPresident Bush in 1995 Find any pressadvisoriesaboutSaudi Arabia thatw ere issued atm ost5 daysbefore a speech given by Donald Rum sfeld that m entionsSaudi Arabia. ObjectSimilarity M apReduceTem plate Given: An objectα oftype γ A collection β ofn datasetsβ 1 ,… , β n Each β i containsa listofj objectsβ i,1 to β i,j oftype γ A sim ilarity function δ(o 1 , o 2 )thatdeterm inesthe sim ilarity betw een o 1 and o 2 oftype γ Find the level ofsim ilarity betw een α and every objectin β using δ. Outputthe resultsorganized by key based on, for exam ple, level ofsim ilarity. i.e. {0-30% Sim ilar, γ[]Objects},{31-60% Sim ilar, γ[]Objects},{etc} Som e Possible Applications Social Netw ork Profiles Given a profile α and itscorresponding friend listF α Return all friendsthathave an 80% sim ilarity based on m usical taste. Spatial Objects Given a search pointα and an unsorted datasetofall M cDonalds restaurants Return all M cDonaldsw ithin 6-10 m ilesofα thathave a “Play Place” Plagiarism Detection Given a docum entα and a corpusofdocum entsβ Find ifthere isa m em berβ i ofβ w ith a level ofsim ilarityto α greaterthan 50% Traditional Im plementation for(inti = 0; i < n; i++) Read datasetβ i for(intj = 0; j < β i .num Records; j++) Read objectβ i,j sim ilarityM easure = δ(α, β i,j ) Place β i,j in appropriate “bucket”based on sim ilarityM easure Sequential Bottlenecksto one objectcom parison at a tim e Data isin one central location Requiresa w rite to file each tim e an additional resultisobtained Partnership for International Research and Education Partnership for International Research and Education A Global Living Laboratory for Cyberinfrastructure Application A Global Living Laboratory for Cyberinfrastructure Application Enablement Enablement Project Title: Computing Object Similarity Using MapReduce Student: Lester Melendez, PhD Student, Florida International University FIU Advisor: Dr. Naphtali Rishe, Professor, Florida International Unviersity PIRE International Partner Advisor: Dr. Rosa Badia, Barcelona Supercomputing Center Industrial Lab Advisor: Howard Ho, IBM Almaden Research Center The material presented in this poster is based upon the work supported by the National Science Foundation under Grant No. OISE- 0730065, IIS-0837716, CNS-0821345, HRD-0833093, and IIP-0829576. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. I. Research Overview and I. Research Overview and Outcome Outcome III. III. Acknowledgement Acknowledgement Latin Am erican G rid L A G rid National Science Foundation II. PIRE Experiences II. PIRE Experiences Received guidance from leading researchers on: IBM’s systemT, JAQL, and Midas MapReduce using Hadoop JSON, DB2, and more! Immersed myself in Bay Area and Catalonian culture. Used the knowledge gained to propose a data driven outcome prediction system. Extracted the US government organizational chart from PDF files using systemT, JAQL, JSON, and Hadoop! Gained intimate knowledge of US government agencies encouraging my pursuit of civil service careers. Made friends from Silicon Valley, China, Spain, Italy, and more!

description

National Science. Foundation. Partnership for International Research and Education A Global Living Laboratory for Cyberinfrastructure Application Enablement. Project Title: Computing Object Similarity Using MapReduce Student: Lester Melendez, PhD Student, Florida International University - PowerPoint PPT Presentation

Transcript of Partnership for International Research and Education

Page 1: Partnership for International Research and Education

Why the need to cope with heterogeneous datasets?

The availability of linked open data (LOD) and other heterogeneous data allows us to dynamically add functionality to our applications

Our GPS navigation system may not have a list of all McDonalds restaurants but, if we tell it where to find a the information then a method such as IBM’s Midas can help us extract the data and we can instantly provide new functionality.

MapReduce allows us to focus less on efficiency and more on dreaming up new content avenues we can open

Once we see a consistent need for the functionality we can proceed to develop an efficient non-mapReduce solution.

MapReduce allows us to provide the functionality right away without significant effort and development

Unsorted Spatial Data Example Given a point α, a set of keywords ε, and a collection of datasets β Return a list of objects in β that contain some or all εi aggregated by

“distance range”. Given

α= (80.98, -127.356) ε = {McDonalds}

Result: List of

All McDonalds between 0-5 miles of α All McDonalds between 6-10 miles of α All McDonalds between 11-20 miles of α All McDonalds 21 miles or more away from α

Paradigm can be tuned quite easily for many different domains!

MapReduce Implementation

Input to Mapper <βi, βi,j> similarityMeasure = δ(α, βi,j ) κ = whichBucket(similarityMeasure) outputIntermittentKeyValue(κ, βi,j )

Intermediate Output <key κ, Object βi,j> Combiner will combine these by key Input into Reducer <key κ, < γ > ObjectIterator> Reducer Output <key κ, γ [] Object>

In the case of the described spatial application the result is simple a concatenated file of all of the objects in the ObjectIterator

The result can be in any format we specify; array, iterator, string, etc.

Overview

ReleasesReleases SpeechesSpeechesAdvisoriesAdvisories TranscriptsTranscripts

HDFS

HTML Docs

Custom Java Crawler

DoD Website

MapReduce Paradigm

• Each file is processed as a record <k1, v1>– <NullWritable, BytesWritable fileContents>

• Who, What, When is extracted and emitted <k2, v2>– <Text Who, Text When>– <Text When, Text What>– <Text When, Text Who>

• Three Possibilities for Reduction:– Reducer aggregates based on Who and returns a list of Whens

• <Text Who, BytesWritable Whens>

– Reducer aggregates based on When and returns a list of Whats• <Text When, BytesWritable Whats>

– Reducer aggregates based on When and returns a list of Whos• <Text When, BytesWritable Whos>

Data

• Approximately 4,000 HTML files in each of the 4 categories. – New files are added daily

• Size ranges from 34k to 500k• Data was duplicated to reach Hadoop file quantity

limit.• Mapper will process entire file as record

– Attempt will be made to adapt code to read one line at a time as well and output aggregate counts.

• i.e. How many speeches were made on July 4, 1996?

Result

• A searchable data structure containing information about all DoD publications since 1994.

• Query front end can easily be developed• Information discovery and data mining

possibilities will exist.

A DoD Publications Archive Search Tool Using MapReduce

Sample Application

• Find all speeches and transcripts of President Bush in 1995

• Find any press advisories about Saudi Arabia that were issued at most 5 days before a speech given by Donald Rumsfeld that mentions Saudi Arabia.

Object Similarity MapReduce Template Given:

An object α of type γ A collection β of n datasets β1,…, βn

Each βi contains a list of j objects βi,1 to βi,j of type γ A similarity function δ(o1 , o2) that determines the similarity

between o1 and o2 of type γ Find the level of similarity between α and every object in

β using δ. Output the results organized by key based on, for

example, level of similarity. i.e. {0-30% Similar, γ[]Objects},{31-60% Similar, γ[]Objects},{etc}

Some Possible Applications

Social Network Profiles Given a profile α and its corresponding friend list Fα Return all friends that have an 80% similarity based on musical

taste. Spatial Objects

Given a search point α and an unsorted dataset of all McDonalds restaurants

Return all McDonalds within 6-10 miles of α that have a “Play Place”

Plagiarism Detection Given a document α and a corpus of documents β Find if there is a member βi of β with a level of similarity to α

greater than 50%

Traditional Implementation for(int i = 0; i < n; i++)

Read dataset βi for(int j = 0; j < βi.numRecords; j++)

Read object βi,j similarityMeasure = δ(α, βi,j ) Place βi,j in appropriate “bucket” based on similarityMeasure

Sequential Bottlenecks to one object comparison at

a time Data is in one central location Requires a write to file each time an

additional result is obtained

Partnership for International Research and EducationPartnership for International Research and EducationA Global Living Laboratory for Cyberinfrastructure Application EnablementA Global Living Laboratory for Cyberinfrastructure Application Enablement

Project Title: Computing Object Similarity Using MapReduceStudent: Lester Melendez, PhD Student, Florida International University

FIU Advisor: Dr. Naphtali Rishe, Professor, Florida International UnviersityPIRE International Partner Advisor: Dr. Rosa Badia, Barcelona Supercomputing Center

Industrial Lab Advisor: Howard Ho, IBM Almaden Research Center

The material presented in this poster is based upon the work supported by the National Science Foundation under Grant No. OISE-0730065, IIS-0837716, CNS-0821345, HRD-0833093, and IIP-0829576. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not

necessarily reflect the views of the National Science Foundation.

I. Research Overview and OutcomeI. Research Overview and Outcome

III. AcknowledgementIII. Acknowledgement

Latin American GridLAGrid

National ScienceFoundation

II. PIRE ExperiencesII. PIRE Experiences

Received guidance from leading researchers on:• IBM’s systemT, JAQL, and Midas• MapReduce using Hadoop• JSON, DB2, and more!

Immersed myself in Bay Area and Catalonian culture.

Used the knowledge gained to propose a data driven outcome prediction system.

Extracted the US government organizational chart from PDF files using systemT, JAQL, JSON, and Hadoop!

Gained intimate knowledge of US government agencies encouraging my pursuit of civil service careers.

Made friends from Silicon Valley, China, Spain, Italy, and more!