1CAMSAP
Transcript of 1CAMSAP
-
8/13/2019 1CAMSAP
1/17
Hadoop for DummiesMapReduce to the Rescue
-
8/13/2019 1CAMSAP
2/17
Introduction
Knowing why MapReduce is essential Understanding how MapReduce works
Looking at the industries that useMapReduce
Considering real-world applications
-
8/13/2019 1CAMSAP
3/17
Solution to Huge Unstructured Data
MapReduce
A software frameworkthat breaks big problems
into small, manageable
tasks and then distributesthem to
multiple servers.
These servers are callednodes, and they
work together in parallelto arrive at a result.
-
8/13/2019 1CAMSAP
4/17
Data requirement for MapReduceIf your job is to coax
insight from a very largedisk-based information
set , measured in terabytesto petabytes, then
MapReduceswill likelymeet your needs.
MapReduce can workwith raw data thats stored
in disk files, in relationaldatabases, or both.
The data may bestructured or
unstructured, and iscommonly made up of
text, binary, or multi-linerecords.
The most commonMapReduce usage patternemploys a distributed filesystem known as HadoopDistributed File System
(HDFS).
Data is stored on localdisk and processing is
done locally on the
computer with the data.
-
8/13/2019 1CAMSAP
5/17
MapReduce Architecture
Map
Key
The key identifieswhat kind of
information werelooking at.
When comparedwith a relationaldatabase, a keyusually equatesto a column.
Value
The value portion ofthe key/value pair is anactual instance of dataassociated with a key.
-
8/13/2019 1CAMSAP
6/17
Examples of Key and Value
First name
Transactionamount
Search term
Danielle
19.96
Snare drums
Firstname/Danielle
Transactionamount/19.96
Search
term/Snare drums
-
8/13/2019 1CAMSAP
7/17
Reduce
After the Map phase is over, all theintermediate values for a given output keyare combined together into a list.
The reduce() function then combines theintermediate values into one or more final
values for the same key.
-
8/13/2019 1CAMSAP
8/17
Configuring MapReduce
Components will
fail at a high rate
Data will becontained in a
relatively smallnumber of big files
Data files are
write-once
Lots of streamingreads
Higher sustained
throughput acrosslarge amounts ofdata
-
8/13/2019 1CAMSAP
9/17
MapReduce in Action (Example)
In this scenario, youre in charge of the e-commerce website for a very large retailer.You stock over 200,000 individual products,and your website receives hundreds of
thousands of visitors each day. In aggregate,your customers place nearly 50,000 orderseach day.
-
8/13/2019 1CAMSAP
10/17
Example (Prepare sorted list of
search terms)
Step 1: The data should ideally be broken intonumerous 1 GB +/- files.
Step 2: Each file will be distributed to a differentnode.
Step 3: On each node, the Map step will produce alist, consisting of each word in the file along withhow many times it appears. For example, one nodemight come up with these intermediate results fromits own set of data:
Skate: 4992120
Ski: 303021
Skis: 291101
-
8/13/2019 1CAMSAP
11/17
-
8/13/2019 1CAMSAP
12/17
Benefit from MapReduce if you have
Lots of Data
Multiple servers atyour disposal
MapReduce-based
softwaresuch asHadoop
-
8/13/2019 1CAMSAP
13/17
MapReduce Users
Financial services Telco
Retail
Government
Defense
Homeland security
Health and life services
Utilities
Social networks/Internet
Internet service providers
-
8/13/2019 1CAMSAP
14/17
Other MapReduce Applications
Risk modeling Recommendation engines
Point of sale transaction analysis
Threat analysis
Search quality
ETL logic for data warehouses
Customer churn analysis
Ad targeting
Network traffic analysis
Trade surveillance
Data sandboxes
-
8/13/2019 1CAMSAP
15/17
Real-World MapReduce Examples
Financial Services
Fraud
detection
Asset
management
Data sourceand data
storeconsolidation
-
8/13/2019 1CAMSAP
16/17
-
8/13/2019 1CAMSAP
17/17
Real-World MapReduce Examples
Auto Manufacturing
Vehiclemodel and
optionvalidation
Vehiclemass
analysis
Emission
reporting
Customer
satisfaction