Cloud Computing Other High-level parallel processing languages Keke Chen.

25
Cloud Computing Other High-level parallel processing languages Keke Chen

Transcript of Cloud Computing Other High-level parallel processing languages Keke Chen.

Page 1: Cloud Computing Other High-level parallel processing languages Keke Chen.

Cloud Computing

Other High-level parallel processing languages

Keke Chen

Page 2: Cloud Computing Other High-level parallel processing languages Keke Chen.

Outline sawzall Dryad and DraydLINQ (MS, abandoned) Hive

Page 3: Cloud Computing Other High-level parallel processing languages Keke Chen.

Sawzall Simplify mapreduce programming Filters + aggregator

mapper reducer

Page 4: Cloud Computing Other High-level parallel processing languages Keke Chen.

Example

mappers

reducers

Convert the input record to float

Page 5: Cloud Computing Other High-level parallel processing languages Keke Chen.

input Sawzall program works on a single

record As a filter filtering through the data stream

Input can be parsed to Values, e.g., float Data structurex: float = input;(variable : type = input)

Page 6: Cloud Computing Other High-level parallel processing languages Keke Chen.

aggregators definition

table agg_name of data_type/variable

Examples: c: table collection of string; S: table sample(100) of string; S: table sum of {count: int, revenue: float}

More aggregators Maximum, quantile, top, unique

Page 7: Cloud Computing Other High-level parallel processing languages Keke Chen.

Indexed aggregators similar to “group by”, the index is group

id Example

t1: table sum[country: string] of intcountry: string = input;Emit t1[country] <- 1;

Page 8: Cloud Computing Other High-level parallel processing languages Keke Chen.

More example

Proto “querylog.proto”queries_per_degree: table sum[lat: int]

[lon:int] of int;Log_record: queryLogProto = input;Loc: Location = locationinfo(log_record.ip);Emit queries_per_degree[int(loc.lat)]

[int(loc.lon)]<-1

Page 9: Cloud Computing Other High-level parallel processing languages Keke Chen.

Performance

Single-CPU speed, Also 51 times slower than compiled C++

Page 10: Cloud Computing Other High-level parallel processing languages Keke Chen.

Performance

Page 11: Cloud Computing Other High-level parallel processing languages Keke Chen.

Dryad and DryadLINQ Dryad provides a low-level parallel data

flow processing interface Acyclic data flow graphs Data communication methods include pipes,

file-based, message, shared-memory

DryadLINQ A high level language for app developers It hides the data flow details

Page 12: Cloud Computing Other High-level parallel processing languages Keke Chen.

Job = Directed Acyclic Graph

Processingvertices Channels

(file, pipe, shared memory)

Inputs

Outputs

Page 13: Cloud Computing Other High-level parallel processing languages Keke Chen.

Runtime

Services Name server Daemon

Job Manager Centralized coordinating process User application to construct graph Linked with Dryad libraries for scheduling

vertices Vertex executable

Dryad libraries to communicate with JM User application sees channels in/out Arbitrary application code, can use local FS

V V V

Page 14: Cloud Computing Other High-level parallel processing languages Keke Chen.

Graph operators

Page 15: Cloud Computing Other High-level parallel processing languages Keke Chen.

Hive Developed by facebook (open source) Mimic SQL language Built on hadoop/mapreduce

Page 16: Cloud Computing Other High-level parallel processing languages Keke Chen.

Hive data model: table etc. Table

Similar to DB table stored in hadoop directories Builtin compression, serialization/deserialization

Partitions Groups in the table Subdirectory in the table directory

Buckets Files in the partition directory Key (column) based partition

/table/partition/bucket1

Page 17: Cloud Computing Other High-level parallel processing languages Keke Chen.

Hive data model: Column type integers, floating point numbers, generic

strings, dates and booleans nestable collection types: array and

map.

Page 18: Cloud Computing Other High-level parallel processing languages Keke Chen.
Page 19: Cloud Computing Other High-level parallel processing languages Keke Chen.

Architecture

Metastore stores the schema of databases. It uses non HDFSdata store

Page 20: Cloud Computing Other High-level parallel processing languages Keke Chen.

Query processing Steps (similar to DBMS)

Parse Semantic analyzer Logical plan generator (algebra tree) Optimizer Physical plan generator (to mapreduce jobs)

Page 21: Cloud Computing Other High-level parallel processing languages Keke Chen.

Operations: DDL and DML HiveQL: SQL like, with slightly different

syntax User defined filtering and aggregation

functions Java only

Map/reduce plugin for streaming process Implemented with any language

Page 22: Cloud Computing Other High-level parallel processing languages Keke Chen.

Example Facebook status updates

Table: status_updates(userid int, status string,ds string) profiles(userid int,school string,gender int)

Operations Load data

LOAD DATA LOCAL INPATH `/logs/status_updates‘ INTO TABLE status_updates PARTITION (ds='2009-03-20')

Count status updates by school and by gender

Page 23: Cloud Computing Other High-level parallel processing languages Keke Chen.

More query examples

Page 24: Cloud Computing Other High-level parallel processing languages Keke Chen.

Query examples

Page 25: Cloud Computing Other High-level parallel processing languages Keke Chen.

Query examples – using hadoopstreaming