FI-WARE Cosmos

49
Open APIs for Open Minds Building your first application using FI-WARE Cosmos, Big Data GE implementation

Transcript of FI-WARE Cosmos

Page 1: FI-WARE Cosmos

Open APIs for Open Minds

Building your first application using FI-WARE

Cosmos, Big Data GE implementation

Page 2: FI-WARE Cosmos

2

Big Data and Open Data:

What is it and how much data is there

Page 3: FI-WARE Cosmos

Big Data and Open Data

3

> open data

> big data

http://commons.wikimedia.org/wiki/File:Interior_view_of_Stockholm_Public_Library.jpg

Page 4: FI-WARE Cosmos

How much data is there?

4

Page 5: FI-WARE Cosmos

Data growing forecast

5

2.3 3.612

19

11.3

39

0.5

1.4

Global users

(billions)

Global networked

devices(billions)

Global broadband speed(Mbps)

Global traffic(zettabytes)

http://www.cisco.com/en/US/netsol/ns827/networking_solutions_sub_solution.html#~forecast

2012

20122012

2012

2017

2017

2017

2017

1 zettabyte = 1021 bytes

1,000,000,000,000,000,000,000 bytes

Page 6: FI-WARE Cosmos

6

How to deal with it:the Hadoop reference

Page 7: FI-WARE Cosmos

Hadoop was created by Doug Cutting at Yahoo!...

7

… based on the MapReduce patent by Google

Page 8: FI-WARE Cosmos

Well, MapReduce was really invented by Julius Caesar

8

Divide etimpera*

* Divide and conquer

Page 9: FI-WARE Cosmos

An example

9

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

LATINREF1P45

GREEKREF2P128

EGYPTREF3P12

LATINpages 45

EGYPTIAN

LATINREF4P73

LATINREF5P34

EGYPTREF6P10

GREEKREF7P20

GREEKREF8P230

45 (ref 1)

still reading…

Mappers

Reducer

Page 10: FI-WARE Cosmos

An example

10

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

GREEKREF2P128

stillreading…

EGYPTIAN

LATINREF4P73

LATINREF5P34

EGYPTREF6P10

GREEKREF7P20

GREEKREF8P230

GREEK

45 (ref 1)

Mappers

Reducer

Page 11: FI-WARE Cosmos

An example

11

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

LATINpages 73

EGYPTIAN

LATINREF4P73

LATINREF5P34

GREEKREF7P20

GREEKREF8P230

LATINpages 34

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

Mappers

Reducer

Page 12: FI-WARE Cosmos

An example

12

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

GREEK

GREEK

GREEKREF7P20

GREEKREF8P230

idle…

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

Mappers

Reducer

Page 13: FI-WARE Cosmos

An example

13

How much pages are written in latin among the booksin the Ancient Library of Alexandria?

idle…

idle…

idle…

45 (ref 1)

+73 (ref 4)

+34 (ref 5)

152 TOTAL

Mappers

Reducer

Page 14: FI-WARE Cosmos

Hadoop architecture

14

head node

Page 15: FI-WARE Cosmos

15

FI-WARE proposal:Cosmos Big Data

Page 16: FI-WARE Cosmos

What is Cosmos?

16

• Cosmos is Telefónica's Big Data and Open Data asset.

• Cosmos is Hadoop ecosystem-based• HDFS as its distributed file system• Hadoop core as its MapReduce engine• HiveQL and Pig for querying the data• Oozie as remote MapReduce jobs and Hive

launcher

• Plus other proprietary features• Dynamic creation of private computing clusters

as a service• Infinity, a cluster for persistent storage• Infinity protocol (secure WebHDFS)• Cygnus, an injector for context data coming from

Orion CB

• Plus open datasets

Page 17: FI-WARE Cosmos

Cosmos architecture

17

Page 18: FI-WARE Cosmos

18

Cluster services:From WebHDFS to

Cygnus

Page 19: FI-WARE Cosmos

Storage services within the Infinity cluster

19

DEPRECATED

Page 20: FI-WARE Cosmos

Computing services within a private cluster

20

Page 22: FI-WARE Cosmos

22

Cosmos open datasets:

Powered by Smart Cities

Page 23: FI-WARE Cosmos

Open Datasets in Cosmos

23

Source Dataset Data type

Notes

Smart Cities

Málaga Plagues tracking Historical

Santander

Smart Santander Sensoring

Data coming through Orion Context BrokerParque de las

LlamasSensoring

Sevilla Bikes renting Historical

Water metering Historical

Census Historical

Infraestructures Historical

Zaragoza Air quality Historical

Other

Twitter FI-WARE-related tweets

Streaming

AEMET Weather Historical

Until September 2013

http://forge.fi-ware.eu/plugins/mediawiki/wiki/fiware/index.php/FI-WARE_open_datasets_central

Page 24: FI-WARE Cosmos

24

How to create clusters:

Getting your roman legion

Page 25: FI-WARE Cosmos

Using the RESTful API (1)

25

Page 26: FI-WARE Cosmos

Using the RESTful API (2)

26

Page 27: FI-WARE Cosmos

Using the RESTful API (3)

27

Page 28: FI-WARE Cosmos

Using the CLI

28

• Creating a cluster$ cosmos create --name <STRING> --size <INT>

• Listing all the clusters$ cosmos list

• Showing a cluster details$ cosmos show <CLUSTER_ID>

• Connecting to the Head Node of a cluster$ cosmos ssh <CLUSTER_ID>

• Terminating a cluster$ cosmos terminate <CLUSTER_ID>

• Listing available services$ cosmos list-services

• Creating a cluster with specific services$ cosmos create --name <STRING> --size <INT>--services <SERVICES_LIST>

Page 29: FI-WARE Cosmos

29

How to exploit the data:

An incremental approach

Page 30: FI-WARE Cosmos

Let’s go step by step…

30

1. Familiarize with Hadoop file system commands

2. Learn how to use WebHDFS/HttpFS REST API

3. Play with the local Hive CLI

4. Write your own remote Hive CLI

5. Write your first MapReduceapplications

6. Use Oozie to remotelylaunch MR and Hive tasks

Page 31: FI-WARE Cosmos

1. Hadoop filesystem commands

31

• Hadoop general command$ hadoop

• Hadoop file system subcommand$ hadoop fs

• Hadoop file system options$ hadoop fs –ls$ hadoop fs –mkdir <hdfs-dir>$ hadoop fs –rmr <hfds-file>$ hadoop fs –cat <hdfs-file>$ hadoop fs –put <local-file> <hdfs-dir>$ hadoop fs –get <hdfs-file> <local-dir>

• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/CommandsManual.html

Page 32: FI-WARE Cosmos

2. WebHDFS/HttpFS REST API

32

• List a directoryGET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS

• Create a new directoryPUT http://<HOST>:<PORT>/<PATH>?op=MKDIRS[&permission=<OCTAL>]

• Delete a file or directoryDELETE http://<host>:<port>/webhdfs/v1/<path>?op=DELETE [&recursive=<true|false>]

• Rename a file or directoryPUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=RENAME&destination=<PATH>

• Concat filesPOST http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CONCAT&sources=<PATHS>

• Set permissionPUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETPERMISSION [&permission=<OCTAL>]

• Set ownerPUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=SETOWNER [&owner=<USER>][&group=<GROUP>]

Page 33: FI-WARE Cosmos

2. WebHDFS/HttpFS REST API (cont.)

33

• Create a new file with initial content (2 steps operation)PUT http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE [&overwrite=<true|false>][&blocksize=<LONG>][&replication=<SHORT>] [&permission=<OCTAL>][&buffersize=<INT>]HTTP/1.1 307 TEMPORARY_REDIRECTLocation: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...Content-Length: 0PUT -T <LOCAL_FILE> http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=CREATE...

• Append to a file (2 steps operation) POST http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=APPEND[&buffersize=<INT>] HTTP/1.1 307 TEMPORARY_REDIRECT Location: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND... Content-Length: 0 POST -T <LOCAL_FILE> http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=APPEND...

Page 34: FI-WARE Cosmos

2. WebHDFS/HttpFS REST API (cont.)

34

• Open and read a file (2 steps operation)GET http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=OPEN [&offset=<LONG>][&length=<LONG>][&buffersize=<INT>]HTTP/1.1 307 TEMPORARY_REDIRECTLocation: http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...Content-Length: 0GET http://<DATANODE>:<PORT>/webhdfs/v1/<PATH>?op=OPEN...

• http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

• HttpFS does not redirect to the Datanode but to the HttpFS server, hidding the Datanodes (and saving tens of public IP addresses)

• The API is the same• http://

hadoop.apache.org/docs/current/hadoop-hdfs-httpfs/index.html

Page 35: FI-WARE Cosmos

3. Local Hive CLI

35

• Hive is a querying tool• Queries are expresed in HiveQL, a SQL-like

language• https://

cwiki.apache.org/confluence/display/Hive/LanguageManual

• Hive uses pre-defined MapReduce jobs for• Column selection• Fields grouping• Table joining• …

• All the data is loaded into Hive tables

Page 36: FI-WARE Cosmos

3. Local Hive CLI (cont.)

36

• Log on to the Master node• Run the hive command• Type your SQL-like sentence!

$ hive$ Hive history file=/tmp/myuser/hive_job_log_opendata_XXX_XXX.txthive>select column1,column2,otherColumns from mytable where column1='whatever' and columns2 like '%whatever%';Total MapReduce jobs = 1Launching Job 1 out of 1Starting Job = job_201308280930_0953, Tracking URL = http://cosmosmaster-gi:50030/jobdetails.jsp?jobid=job_201308280930_0953Kill Command = /usr/lib/hadoop/bin/hadoop job -Dmapred.job.tracker=cosmosmaster-gi:8021 -kill job_201308280930_09532013-10-03 09:15:34,519 Stage-1 map = 0%, reduce = 0%2013-10-03 09:15:36,545 Stage-1 map = 67%, reduce = 0%2013-10-03 09:15:37,554 Stage-1 map = 100%, reduce = 0%2013-10-03 09:15:44,609 Stage-1 map = 100%, reduce = 33%…

Page 37: FI-WARE Cosmos

4. Remote Hive client

37

• Hive CLI is OK for human-driven testing purposes• But it is not usable by remote applications

• Hive has no REST API• Hive has several drivers and libraries

• JDBC for Java• Python• PHP• ODBC for C/C++• Thrift for Java and C++• https://

cwiki.apache.org/confluence/display/Hive/HiveClient

• A remote Hive client usually performs:• A connection to the Hive server• The query execution

Page 38: FI-WARE Cosmos

4. Remote Hive client – Get a connection

38

private Connection getConnection( String ip, String port, String user, String password) { try { // dynamically load the Hive JDBC driver Class.forName("org.apache.hadoop.hive.jdbc.HiveDriver"); } catch (ClassNotFoundException e) { System.out.println(e.getMessage()); return null; } // try catch try { // return a connection based on the Hive JDBC driver, default DB return DriverManager.getConnection("jdbc:hive://" + ip + ":" + port + "/default?user=" + user + "&password=" + password); } catch (SQLException e) { System.out.println(e.getMessage()); return null; } // try catch} // getConnection

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic-client

Page 39: FI-WARE Cosmos

4. Remote Hive client – Do the query

39

private void doQuery() { try { // from here on, everything is SQL! Statement stmt = con.createStatement(); ResultSet res = stmt.executeQuery("select column1,column2," + "otherColumns from mytable where column1='whatever' and " + "columns2 like '%whatever%'");

// iterate on the result while (res.next()) { String column1 = res.getString(1); Integer column2 = res.getInteger(2); // whatever you want to do with this row, here } // while

// close everything res.close(); stmt.close(); con.close(); } catch (SQLException ex) { System.exit(0); } // try catch} // doQuery

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/hive-basic-client

Page 40: FI-WARE Cosmos

4. Remote Hive client – Plague Tracker demo

40

https://github.com/telefonicaid/fiware-connectors/tree/develop/resources/plague-tracker

Page 41: FI-WARE Cosmos

5. MapReduce applications

41

• MapReduce applications are commonly written in Java

• Can be written in other languages through Hadoop Streaming

• They are executed in the command line

$ hadoop jar <jar-file> <main-class> <input-dir> <output-dir>

• A MapReduce job consists of:• A driver, a piece of software where to define inputs, outputs,

formats, etc. and the entry point for launching the job• A set of Mappers, given by a piece of software defining its

behaviour• A set of Reducers, given by a piece of software defining its

behaviour• There are 2 APIS

• org.apache.mapred old one• org.apache.mapreduce new one

• Hadoop is distributed with MapReduce examples• [HADOOP_HOME]/hadoop-examples.jar

Page 42: FI-WARE Cosmos

5. MapReduce applications – Map

42

/* org.apache.mapred example */public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { /* use the input value, the input key is the offset within the file and it is not necessary in this example */ String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line);

/* iterate on the string, getting each word */ while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); /* emit an output (key,value) pair based on the word and 1 */ output.collect(word, one); } // while } // map} // MapClass

Page 43: FI-WARE Cosmos

5. MapReduce applications – Reduce

43

/* org.apache.mapred example */public static class ReduceClass extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0;

/* iterate on all the values and add them */ while (values.hasNext()) { sum += values.next().get(); } // while

/* emit an output (key,value) pair based on the word and its count */ output.collect(key, new IntWritable(sum)); } // reduce} // ReduceClass

Page 44: FI-WARE Cosmos

5. MapReduce applications – Driver

44

/* org.apache.mapred example */package my.org

import java.io.IOException;import java.util.*;

import org.apache.hadoop.fs.Path;import org.apache.hadoop.conf.*;import org.apache.hadoop.io.*;import org.apache.hadoop.mapred.*;import org.apache.hadoop.util.*;

public class WordCount { public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(ReduceClass.class); conf.setReducerClass(ReduceClass.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } // main} // WordCount

Page 45: FI-WARE Cosmos

6. Launching tasks with Oozie

45

• Oozie is a workflow scheduler system to manage Hadoop jobs

• Java map-reduce• Pig and Hive• Sqoop• System specific jobs (such as Java programs and shell scripts)

• Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.

• Writting Oozie applications is about including in a package

• The MapReduce jobs, Hive/Pig scritps, etc (exeutable code)• A Workflow• Parameters for the Workflow

• Oozie can be use locally or remotely• https://

oozie.apache.org/docs/4.0.0/index.html#Developer_Documentation

Page 46: FI-WARE Cosmos

6. Launching tasks with Oozie – Java client

46

OozieClient client = new OozieClient("http://130.206.80.46:11000/oozie/");

// create a workflow job configuration and set the workflow application pathProperties conf = client.createConfiguration();conf.setProperty(OozieClient.APP_PATH, "hdfs://cosmosmaster-gi:8020/user/frb/mrjobs");conf.setProperty("nameNode", "hdfs://cosmosmaster-gi:8020");conf.setProperty("jobTracker", "cosmosmaster-gi:8021");conf.setProperty("outputDir", "output");conf.setProperty("inputDir", "input");conf.setProperty("examplesRoot", "mrjobs");conf.setProperty("queueName", "default");

// submit and start the workflow jobString jobId = client.run(conf);

// wait until the workflow job finishes printing the status every 10 secondswhile (client.getJobInfo(jobId).getStatus() == WorkflowJob.Status.RUNNING) {

System.out.println("Workflow job running ..."); Thread.sleep(10 * 1000);} // while

System.out.println("Workflow job completed");

Page 47: FI-WARE Cosmos

Further reading

47

• The datasets are described at:• http://tinyurl.com/cosmos-datasets

• Hive remote basic client:• https://github.com/telefonicaid/fiware-connectors/tree/develo

p/resources/hive-basic-client

• Plague Tracker demo:• https

://github.com/telefonicaid/fiware-livedemoapp/tree/master/cosmos/plague-tracker

• http://130.206.81.65/plague-tracker/

• More detailed information can be found here:• http://tinyurl.com/cosmos-programmer-guide• http://tinyurl.com/cosmos-apis• http://tinyurl.com/cosmos-architecture

Page 49: FI-WARE Cosmos

http://fi-ppp.eu

http://fi-ware.eu

Follow @Fiware on Twitter!

Thanks !

49