15 minute presentation about Thesis

40
Too much Data! Sven Meys Saturday 9 February 13

description

 

Transcript of 15 minute presentation about Thesis

Page 1: 15 minute presentation about Thesis

Too much Data!Sven Meys

Saturday 9 February 13

Page 2: 15 minute presentation about Thesis

On-demand

Information Extraction from

Remote Sensing Images

with

MapReduce

Onderwerp

Saturday 9 February 13

Page 3: 15 minute presentation about Thesis

Inhoud

• Context

• Literatuurstudie

• Planning

Saturday 9 February 13

Page 4: 15 minute presentation about Thesis

Context

• VITO

• Remote Sensing

• Probleemstelling

• Onderzoeksvragen

Saturday 9 February 13

Page 5: 15 minute presentation about Thesis

700 €103 Milj.84%

16%

GovernmentPrivate

Saturday 9 February 13

Page 6: 15 minute presentation about Thesis

Energy Industrial Innovation

Energy Technology

Transition Energy &

Environment

Environ- mental

Analysis &

Techno- logy

Material Techno-

logy

Separation &

Conversion Technology

Quality of Environment

Remote Sensing

Environ- mental

Modelling

Environ- mental Health

Saturday 9 February 13

Page 7: 15 minute presentation about Thesis

Context

• VITO

• Remote Sensing

• Probleemstelling

• Onderzoeksvragen

Saturday 9 February 13

Page 8: 15 minute presentation about Thesis

Saturday 9 February 13

Page 9: 15 minute presentation about Thesis

Saturday 9 February 13

Page 10: 15 minute presentation about Thesis

Remote Sensing

Saturday 9 February 13

Page 11: 15 minute presentation about Thesis

1 km2 per pixel0.5 miljard pixels1.2 GB

Saturday 9 February 13

Page 12: 15 minute presentation about Thesis

RS Toepassingen

Saturday 9 February 13

Page 13: 15 minute presentation about Thesis

01-01-2001

01-01-2012

NDVI

Time Series:

Algorithm:

MeanOutput:

SUBMITSaturday 9 February 13

Page 14: 15 minute presentation about Thesis

Context

• VITO

• Remote Sensing

• Probleemstelling

• Onderzoeksvragen

Saturday 9 February 13

Page 15: 15 minute presentation about Thesis

Probleemstelling

Betere sensorenBetere beelden

Meer data Duurdere opslag

Meer informatie

Data Transport

Meer rekenwerkDure supercomputersParallel Processing

Saturday 9 February 13

Page 16: 15 minute presentation about Thesis

Doelstellingen

• Snel genoeg

• Betaalbaar

• Schaalbaar Bestandssysteem+

Software framework

Saturday 9 February 13

Page 17: 15 minute presentation about Thesis

Onderzoeksvragen• Hoe kunnen grote satellietbeelden in

een HDFS filesysteem opgeslagen worden zodat ze op een efficiënte manier in parallel verwerkt kunnen worden?

• Welke algoritmes kunnen gebruikt worden met deze opslagtechniek en MapReduce?

Saturday 9 February 13

Page 18: 15 minute presentation about Thesis

Inhoud

• Context

• Literatuurstudie

• Planning

Saturday 9 February 13

Page 19: 15 minute presentation about Thesis

Literatuurstudie• Interessante projecten

• HDFS

• MapReduce

• Implementaties

• Distributies

• Huidige Literatuur

Saturday 9 February 13

Page 20: 15 minute presentation about Thesis

Interessante projecten• NA (12)

• Center for Climate Simulation

• Square Kilometer Array: 700 TB/sec

• Open Cloud Consortium(13)

• Project Matsu: Elastic Clouds for Disaster Relief

• : Large Hadron Collider (14)

• 20 PB/jaarSaturday 9 February 13

Page 21: 15 minute presentation about Thesis

HDFS

• Gedistribueerd bestandssysteem

• Gebaseerd op the Google File System(1)

• Grote blokken (128 MiB)

• Commodity hardware

• Falen = standaard

• Read & append (1)

1

2

...

...n

Saturday 9 February 13

Page 22: 15 minute presentation about Thesis

HDFS

Calvalus Final Report Brockmann Consult GmbH

Page 8 / 43 Copyright © Brockmann Consult GmbH

3 Technical Approach

3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.

3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.

Figure 2: File blocks, distribution and replication in a distributed file system

Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.

Figure 3: Automatic repair in case of cluster node failure by additional replication

Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.

1

3

2

1

1

2

3

1

3

2

2

3

1

3

2

1

1

3

2

2

3

1

3

2

2

3

3

2

1

1

3

2

Saturday 9 February 13

Page 23: 15 minute presentation about Thesis

HDFS Brockmann Consult GmbH Calvalus Final Report

Copyright © Brockmann Consult GmbH Page 9 / 43

Figure 4: Block assembly for data retrieval from the distributed file system

3.1.2 Data Locality Data processing systems that need to read and write large amounts of data perform best if the data I/O takes place on local storage devices. In clusters, where storage nodes are separated from compute nodes, two situations are likely:

1. Network bandwidth is the bottleneck, especially when multiple tasks work in parallel on the same input data but from different compute nodes and when storage nodes are separated from compute nodes.

2. Transfer rates of the local hard drives are the bottleneck, especially when multiple tasks are working in parallel on single (multi-CPU, multi-core) compute nodes.

A solution to these problems is to first use a cluster whose nodes are both, compute and storage nodes. Secondly, it is to distribute the processing tasks and execute them on the nodes that are “close”  to  the  data,  with  respect  to  the  network  topology  (see  Figure 5). Parallel processing of inputs is done on splits. A split is a logical part of an input file that usually has the size of the blocks that store the data, but in contrast to a block that ends at an arbitrary byte position, a split is always aligned at file format specific record boundaries (see next chapter, step 1). Since splits are roughly aligned with file blocks, processing of input splits can be performed data-local.

Figure 5: Data-local processing and result assembly for retrieval

3.1.3 MapReduce Programming Model The MapReduce programming model has been published in 2004 by the two Google scientists J. Dean and S. Ghemawat [RD 4]. It is used for processing and generation of huge datasets on clusters for certain kinds of distributable problems. The model is composed of a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate keys. Many real world problems can be expressed in terms of this model and programs written in this functional style can be easily parallelised.

1

3

2 3

2

1

1

2

3

1

3

2

1

3

3

1

1

3

2

2

1

2

3

Saturday 9 February 13

Page 24: 15 minute presentation about Thesis

HDFS

Calvalus Final Report Brockmann Consult GmbH

Page 8 / 43 Copyright © Brockmann Consult GmbH

3 Technical Approach

3.1 Hadoop Distributed Computing The basis of the Calvalus processing system is Apache Hadoop. Hadoop is an industry proven open-source software capable of running clusters of tens to ten thousands of computers and processing ultra large amounts of data based on massive parallelisation and a distributed file system.

3.1.1 Distributed File System (DFS) In opposite to a local file system, the Network File System (NFS) or the Common Internet File System (CIFS), a distributed file system (DFS) uses multiple nodes in a cluster to store the files and data resources [RD-5]. A DFS usually accounts for transparent file replication and fault tolerance and furthermore enables data locality for processing tasks. A DFS does this by subdividing files into blocks and replicating these blocks within a cluster of computers. Figure 2 shows the distribution and replication (right) of a file (left) subdivided into three blocks.

Figure 2: File blocks, distribution and replication in a distributed file system

Figure 3 demonstrates how the file system handles node-failure by automated recovery of under-replicated blocks. HDFS further uses checksums to verify block integrity. As long as there is at least one integer and accessible copy of a block, it can automatically re-replicate to return to the requested replication rate.

Figure 3: Automatic repair in case of cluster node failure by additional replication

Figure 4 shows how a distributed file system re-assembles blocks to retrieve the complete file for external retrieval.

1

3

2

1

1

2

3

1

3

2

2

3

1

3

2

1

1

3

2

2

3

1

3

2

2

3

3

2

1

1

3

2

Saturday 9 February 13

Page 25: 15 minute presentation about Thesis

HDFS - Overzicht

• Schaalbaar

• Snel lezen/schrijven

• Robuust

• Factor 10 goedkoper (2)

Saturday 9 February 13

Page 26: 15 minute presentation about Thesis

MapReduce

Saturday 9 February 13

Page 27: 15 minute presentation about Thesis

MapReduce - WordCount

Saturday 9 February 13

Page 28: 15 minute presentation about Thesis

MapReduce - Overzicht

• Based on Google MapReduce (3)

• Data Locality

• Key/Value pairs

• Zeer snel

• Andere manier van denken

Saturday 9 February 13

Page 29: 15 minute presentation about Thesis

Implementaties

• Apache Software Foundation

• Anderen: outdated, commercieel, weinig support (4-6)

Hadoop Stratosphere HPCCSupport + - +

Extensions + - ?Community +++ +/- -

Target ANY EDU BI

Saturday 9 February 13

Page 30: 15 minute presentation about Thesis

Distributies

• Hortonworks (7)

• Cloudera : Cloudera Manager (9)

• Web Interface

• 1-Click install. (yeah right...)

• Interessant licentie model

(8)

Saturday 9 February 13

Page 31: 15 minute presentation about Thesis

Algemeen

• Vooral tekstverwerking

• Voor kleine afbeeldingen (10)

• Weinig detail

• Commercieel (11)

Saturday 9 February 13

Page 32: 15 minute presentation about Thesis

Inhoud

• Context

• Literatuurstudie

• Planning

Saturday 9 February 13

Page 33: 15 minute presentation about Thesis

Planning

literatuur

fase 1

fase 2fase 3fase 4

vandaagverslag

inleverenmasterproef

01/0215/03

20/05

stage

01/09

Saturday 9 February 13

Page 34: 15 minute presentation about Thesis

Fase 1 - Done

Sven

Master

Workstation

Patrick

Workstation

Bruno

Workstation

Tim

DN

DN DN DNNN

JT TTTTTT

TT

192.168.10.245 192.168.10.246 192.168.10.247

192.168.10.248

192.168.10.249

TT

JT NN

DN

= Job Tracker

= Task Tracker

= Name Node

= Data Node

= RedHat 6.2 Workstation

= RedHat 6.2 Virtual Machine

Saturday 9 February 13

Page 35: 15 minute presentation about Thesis

Fase 2

• Eenvoudig algoritme

• Beeld draaien

• Standaard IO

• HDFS

Saturday 9 February 13

Page 36: 15 minute presentation about Thesis

Fase 3

• Meer complexiteit: MapReduce

• Spatiaal: Convolutiemasker, ROI

• Temporeel/Spectraal: Meerdere afbeeldingen

Saturday 9 February 13

Page 37: 15 minute presentation about Thesis

Fase 4• Performantie in functie van pixel

afstand

Saturday 9 February 13

Page 38: 15 minute presentation about Thesis

Planning

literatuur

fase 1

fase 2fase 3fase 4

vandaagverslag

inleverenmasterproef

01/0215/03

20/05

stage

01/09

Saturday 9 February 13

Page 39: 15 minute presentation about Thesis

The End• Veel data

• Anders denken

• Veel mogelijkheden• RLZ of nieuw keuzevak Big Data? ;)

• Mapreduce + OpenCL?

• Veel uitdagingen

• Veel vragenSaturday 9 February 13

Page 40: 15 minute presentation about Thesis

Referenties(1) Ghemawat, S., Gobioff, H. and Leung, S.-T. (2003), ‘The google file system’

(2) Krishnan, S., Baru, C. and Crosby, C. (2010), ‘Evaluation of mapreduce for gridding lidar data’

(3) Dean, J., Ghemawat, S. and Inc, G. (2004), ‘Mapreduce: simplified data processing on large clusters’

(4) http://hadoop.apache.org/

(5) Warneke, D. and Kao, O. (2009), ‘Nephele: Efficient parallel data processing in the cloud’, http://www.stratosphere.eu

(6) http://hpccsystems.com/

(7) http://hortonworks.com/

(8) http://mapr.com/

(9) http://cloudera.com/

(10) Sweeney, C. (2011), ‘Hipi: Hadoop image processing interface for image-based mapreduce’

(11) Guinan, O. (2011), ‘Indexing the earth - large scale satellite image processing using hadoop’, http://www.cloudera.com/content/cloudera/en/resources/library/hadoopworld/hadoop-world-2011-presentation-video-indexing-the-earth-large-scale-satellite-image-processing-using-hadoop.htmt(12) Q. Duffy, D. (2013), ‘Untangling the computing landscape for NASA climate simulations’. URL: http://www.nas.nasa.gov/SC12/demos/demo20.html(13) http://www.slideshare.net/rgrossman/project-matsu-elastic-clouds-for-disaster-relief

(14) Lassnig, M., Garonne, V., Dimitrov, G. and Canali, L. (2012), ‘Atlas data management accounting with hadoop pig and hbase’.

Saturday 9 February 13