LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...
Transcript of LASER Foundation - Breitman Part 2laser.inf.ethz.ch/2013/material/breitman/Breitman_Part_2.pdf ·...
1
Part 2
Karin Breitman Brazil R&D Center
2
3 Data Collection
Raw data storage
ETL RDBMS
BI
4 Data Collection
Raw data storage
ETL RDBMS
BI
5
Genesis - Google
6
Hadoop • Distributed system for data storage and
processing (open source under the Apache license).
7
Hadoop
• Storage & Compute in 1 Framework • Open Source Project of the Apache Software Foundation • Written in Java
HDFS MapReduce
Two Core Components
Storage in the Hadoop Distributed File System
Compute via the MapReduce distributed processing platform
8
9
that said…. RDBMS • Schema on write
• Reads are fast
• Adjusting required
• Structured
• Good for: – OLAP – ACID transactions – Operational data store
Hadoop • Schema on read
• Writes are fast
• Ingested as is
• Loosely structured
• Good for: – Data discovery – Unstructured data – Massive storage
10
And.. • Hadoop is a paradigm shift in the way we think about and manage data
• Traditional solutions were not designed with growth in mind
• Big-Data accelerates this problem dramatically
Category Traditional RDBMS Hadoop
Scalability
Resource constrained Linear Expansion
Re-architecture Seamless addition & subtraction of nodes
~ 10TB ~ 5PB
Fault Tolerance
After thought, many critical points of failure
Designed in, tasks are automatically restarted
Problem Space
Transactional, OLTP Batch, OLAP
Inability to incorporate new sources No bounds
11
More importantly
• Structural changes to RDBMS (ex. Add a new column) are really, really hard!
12
HDFS Concepts • Performs best with a ‘modest’ number of large files
– Millions, rather than billions, of files – Each file typically 100Mb or more
• Files in HDFS are ‘write once’ – No random writes to files are allowed – Append support is available – HDFS is optimized for large, streaming reads of files – Rather than random reads
13
HDFS • Hadoop Distributed File System
– Data is organized into files & directories – Files are divided into blocks, typically 64-128MB each, and
distributed across cluster nodes – Block placement is known at runtime by map-reduce so
computation can be co-located with data – Blocks are replicated (default is 3 copies) to handle failure – Checksums are used to ensure data integrity
• Replication is the one and only strategy for error handling, recovery and fault tolerance
14
Hadoop Architecture - HDFS • Block level storage • N-Node replication • Namenode for
– File system index (EditLog) – Access coordination
• Datanode for – Data Block Management – Job Execution (MapReduce)
• Automated Fault Tolerance
Put
15
NameNode • Provides a centralized, repository for the
namespace – A index of what files are stored in which blocks
• Responds to client requests (map-reduce jobs) by coordinating distribution of tasks
16
Hadoop treats all nodes as Data Nodes, meaning that they can store data, but designates at least one node to be the Name Node.
Hadoop File System is classified as a “distributed” file system because it manages the storage across a network of machines and the files are distributed across several nodes, in the same or different racks or clusters.
For each Hadoop file, the Name Node decides in which disk each one of the copies of each one of the File Blocks will reside and keeps track of all that information in tables stored locally in its local disks.
17
When a node fails, the Name Node identifies all the file blocks that have been affected; retrieves copies of these file blocks from other healthy nodes;
Finds new nodes to store another copy of them; stores these other copies there; and updates this information in its tables.
18
When an application needs to read a file, it first connects to the Name Node to get the addresses for the disk blocks where the file blocks are and the application can then read these blocks directly without going through the Name Node again.
One of the common concerns about the Hadoop Distributed File System is the fact that the Name Node can become a single point of failure.
19
File System Browser
20
MapReduce
21
Map Reduce Framework • Map step
– Input records are parsed into intermediate key/value pairs
– Multiple Maps per Node • 10TB => 128MB/Blk => 82K Maps
• Reduce step – Each Reducer handles all like keys – 3 Steps
• Shuffle: All like keys are retrieved from each Mapper
• Sort: Intermediate keys are sorted prior to reduce
• Reduce: Values are processed
22
Map Reduce
23
MapReduce programming with Java • Very low level access to hadoop api’s
– Ultimately not the best/easiest way to interact for a Data Scientist – Components: – Mapper: Class & method (map) called by framework to process (parse) the
src data line by line – Reducer: Class & method (map) called by framework to process (combine)
the output of the Mappers and build the final output – Job: Runtime context for hadoop
24
Reduce Task • After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list • This list is given to a Reducer
– There may be a single Reducer, or multiple Reducers – This is specified as part of the job configuration (see later) – All values associated with a particular intermediate key are
guaranteed to go to the same Reducer – The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order – This step is known as the ‘shuffle and sort’
• The Reducer outputs zero or more final key/value pairs – These are written to HDFS
25
Bestiary
HDFS
…..
MapReduce
java python
stream
Packages Hive
HBase
Pig
Analytics Mahout
R
26
Additional Slides
27
Hive • Pseudo database on top of HDFS • Stores data on hdfs (/user/hive/warehouse) • Each table has a directory with files underneath • Files are delimited, Sequence Files, Map parts, Reduce parts • has a command line interface and Thrift server • stores metadata in derby (default) or MySQL or Postgres • Best place for syntax:
– http://hive.apache.org and view the manual
• Ability to create UDFs
28
Pig • Provides a mechanism for using MapReduce without
programming in Java – Utilizes HDFS & MapReduce
• Allows for a more intuitive means to specify data flows – High-level sequential, data flow language – Pig Latin, flow expression – Python integration
• Comfortable for researchers who are familiar with Perl & Python • Pig is easier to learn & execute, but more limited in
scope of functionality than java
29
Mahout • Important stuff first: most common pronunciation is “Ma-h-
out” – rhymes with ‘trout’ • Machine Learning Library that Runs on HDFS • 4 Primary Use Cases:
– Recommendation Mining – People who like X, also like Y – Clustering – Topic based association – Classification – Assign new docs to existing categories – Frequent Itemset Mining – Which things will appear together