LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...

1

Part 2

Karin Breitman Brazil R&D Center

3 Data Collection

Raw data storage

ETL RDBMS

BI

4 Data Collection

Raw data storage

ETL RDBMS

BI

5

Genesis - Google

6

Hadoop • Distributed system for data storage and

processing (open source under the Apache license).

7

Hadoop

•  Storage & Compute in 1 Framework •  Open Source Project of the Apache Software Foundation •  Written in Java

HDFS MapReduce

Two Core Components

Storage in the Hadoop Distributed File System

Compute via the MapReduce distributed processing platform

9

that said…. RDBMS •  Schema on write

•  Reads are fast

•  Adjusting required

•  Structured

•  Good for: –  OLAP –  ACID transactions –  Operational data store

Hadoop •  Schema on read

•  Writes are fast

•  Ingested as is

•  Loosely structured

•  Good for: –  Data discovery –  Unstructured data –  Massive storage

10

And.. •  Hadoop is a paradigm shift in the way we think about and manage data

•  Traditional solutions were not designed with growth in mind

•  Big-Data accelerates this problem dramatically

Category Traditional RDBMS Hadoop

Scalability

Resource constrained Linear Expansion

Re-architecture Seamless addition & subtraction of nodes

~ 10TB ~ 5PB

Fault Tolerance

After thought, many critical points of failure

Designed in, tasks are automatically restarted

Problem Space

Transactional, OLTP Batch, OLAP

Inability to incorporate new sources No bounds

11

More importantly

•  Structural changes to RDBMS (ex. Add a new column) are really, really hard!

12

HDFS Concepts •  Performs best with a ‘modest’ number of large files

–  Millions, rather than billions, of files –  Each file typically 100Mb or more

•  Files in HDFS are ‘write once’ –  No random writes to files are allowed –  Append support is available –  HDFS is optimized for large, streaming reads of files –  Rather than random reads

13

HDFS • Hadoop Distributed File System

–  Data is organized into files & directories –  Files are divided into blocks, typically 64-128MB each, and

distributed across cluster nodes –  Block placement is known at runtime by map-reduce so

computation can be co-located with data –  Blocks are replicated (default is 3 copies) to handle failure –  Checksums are used to ensure data integrity

• Replication is the one and only strategy for error handling, recovery and fault tolerance

14

Hadoop Architecture - HDFS •  Block level storage •  N-Node replication •  Namenode for

–  File system index (EditLog) –  Access coordination

•  Datanode for –  Data Block Management –  Job Execution (MapReduce)

•  Automated Fault Tolerance

Put

15

NameNode • Provides a centralized, repository for the

namespace –  A index of what files are stored in which blocks

• Responds to client requests (map-reduce jobs) by coordinating distribution of tasks

16

Hadoop treats all nodes as Data Nodes, meaning that they can store data, but designates at least one node to be the Name Node.

Hadoop File System is classified as a “distributed” file system because it manages the storage across a network of machines and the files are distributed across several nodes, in the same or different racks or clusters.

For each Hadoop file, the Name Node decides in which disk each one of the copies of each one of the File Blocks will reside and keeps track of all that information in tables stored locally in its local disks.

17

When a node fails, the Name Node identifies all the file blocks that have been affected; retrieves copies of these file blocks from other healthy nodes;

Finds new nodes to store another copy of them; stores these other copies there; and updates this information in its tables.

18

When an application needs to read a file, it first connects to the Name Node to get the addresses for the disk blocks where the file blocks are and the application can then read these blocks directly without going through the Name Node again.

One of the common concerns about the Hadoop Distributed File System is the fact that the Name Node can become a single point of failure.

19

File System Browser

20

MapReduce

21

Map Reduce Framework •  Map step

–  Input records are parsed into intermediate key/value pairs

–  Multiple Maps per Node •  10TB => 128MB/Blk => 82K Maps

•  Reduce step –  Each Reducer handles all like keys –  3 Steps

•  Shuffle: All like keys are retrieved from each Mapper

•  Sort: Intermediate keys are sorted prior to reduce

•  Reduce: Values are processed

22

Map Reduce

23

MapReduce programming with Java •  Very low level access to hadoop api’s

–  Ultimately not the best/easiest way to interact for a Data Scientist –  Components: –  Mapper: Class & method (map) called by framework to process (parse) the

src data line by line –  Reducer: Class & method (map) called by framework to process (combine)

the output of the Mappers and build the final output –  Job: Runtime context for hadoop

24

Reduce Task •  After the Map phase is over, all the intermediate values for a

given intermediate key are combined together into a list •  This list is given to a Reducer

–  There may be a single Reducer, or multiple Reducers –  This is specified as part of the job configuration (see later) –  All values associated with a particular intermediate key are

guaranteed to go to the same Reducer –  The intermediate keys, and their value lists, are passed to the

Reducer in sorted key order –  This step is known as the ‘shuffle and sort’

•  The Reducer outputs zero or more final key/value pairs –  These are written to HDFS

25

Bestiary

HDFS

…..

MapReduce

java python

stream

Packages Hive

HBase

Pig

Analytics Mahout

R

26

Additional Slides

27

Hive •  Pseudo database on top of HDFS •  Stores data on hdfs (/user/hive/warehouse) •  Each table has a directory with files underneath •  Files are delimited, Sequence Files, Map parts, Reduce parts •  has a command line interface and Thrift server •  stores metadata in derby (default) or MySQL or Postgres •  Best place for syntax:

–  http://hive.apache.org and view the manual

•  Ability to create UDFs

28

Pig •  Provides a mechanism for using MapReduce without

programming in Java –  Utilizes HDFS & MapReduce

•  Allows for a more intuitive means to specify data flows –  High-level sequential, data flow language –  Pig Latin, flow expression –  Python integration

•  Comfortable for researchers who are familiar with Perl & Python •  Pig is easier to learn & execute, but more limited in

scope of functionality than java

29

Mahout •  Important stuff first: most common pronunciation is “Ma-h-

out” – rhymes with ‘trout’ •  Machine Learning Library that Runs on HDFS •  4 Primary Use Cases:

–  Recommendation Mining – People who like X, also like Y –  Clustering – Topic based association –  Classification – Assign new docs to existing categories –  Frequent Itemset Mining – Which things will appear together

LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...

Documents

Transcript of LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...