TOP updated Latest List of Hadoop Big Data projects 2016 - 2017 | Hadoop big data latest final Year...

19

Transcript of TOP updated Latest List of Hadoop Big Data projects 2016 - 2017 | Hadoop big data latest final Year...

Cloud Computing leverages Hadoop framework for processing BigData in parallel. Hadoop has certain

limitations that could be exploited to execute the job efficiently. These limitations are mostly because

of data locality in the cluster, jobs and tasks scheduling, and resource allocations in Hadoop. Efficient

resource allocation remains a challenge in Cloud Computing MapReduce platforms. We propose

H2Hadoop, which is an enhanced Hadoop architecture that reduces the computation cost associated

with BigData analysis. The proposed architecture also addresses the issue of resource allocation in

native Hadoop. H2Hadoop provides a better solution for “text data”, such as finding DNA sequence

and the motif of a DNA sequence. Also, H2Hadoop provides an efficient Data Mining approach for

Cloud Computing environments. H2Hadoop architecture leverages on NameNode’s ability to assign

jobs to the TaskTrakers (DataNodes) within the cluster. By adding control features to the NameNode,

H2Hadoop can intelligently direct and assign tasks to the DataNodes that contain the required data

without sending the job to the whole cluster. Comparing with native Hadoop, H2Hadoop reduces CPU

time, number of read operations, and another Hadoop factors.

ETPL

BD -001 H2Hadoop: Improving Hadoop Performance using the Metadata of

Related Jobs

Hadoop has two components namely HDFS and MapReduce. Hadoop stores user data based on space

utilization of datanodes on the cluster rather than the processing capability of the datanodes.

Furthermore Hadoop runs in a heterogeneous environment as all datanodes may not be homogeneous.

For these reasons, workload imbalances will occur when jobs run in a Hadoop cluster resulting in poor

performance. In this paper, we propose a dynamic algorithm to balance the workload between different

racks on a Hadoop cluster based on the log files of Hadoop. However, if the tasks are executing on

critical or sensitive data in a secured rack, the data transfer to an unsecured cluster will result in privacy

being compromised. We propose an approach to transfer data between racks without disclosing private

information. Moving tasks from the most overloaded rack to another rack improves the performance

of MapReduce jobs. Our simulations indicate that the proposed algorithm decreases running time of a

job by more than 50% running on the most overload rack.

ETPL

BD - 002 Privacy Preserving Rack-Based Dynamic Workload Balancing for

Hadoop Map Reduce

Big data analysis has gained significant popularity within the last few years. The MapReduce

framework presented by Google makes it easier to write applications that process vast amount of data.

MapReduce targets at large commodity clusters where failures are not exceptions. However, Hadoop,

the most popular implementation of MapReduce performs poorly under failures. Hadoop implements

the fault tolerance strategy at the task level, as a result, a task failure will require a re-execution of the

whole task regardless of how much input has already been processed. In this paper, we present BeTL

which introduces slight changes to the execution flow of MapReduce, and makes it possible to gain a

finer-grained fault tolerance. Map tasks can create checkpoints so that a retrying task doesn't have to

start from scratch and thus saves much time. Speculation strategy can also benefit from this. The new

execution flow involves less IO operations and performs better than Hadoop even under no failures. In

our experiments, BeTL outperforms Hadoop by 6.6 percent on average under no failures and 4.6 to

51.0 percent under different failure densities.

ETPL

BD -003 BeTL: Map Reduce Checkpoint Tactics Beneath the Task Level

Hadoop is a standard implementation of MapReduce framework for running data-intensive

applications on the clusters of commodity servers. By thoroughly studying the framework we find out

that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the

application performance. There is a problem of variance in both the intermediate key's frequencies and

their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This

variance in system causes network overhead which leads to unfairness on the reduce input among

different data nodes in the cluster. Because of the above problem, applications experience performance

degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm;

unlike previous systems our algorithm considers a node's capabilities as heuristics to decide a better

available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's

partitioning algorithm and Leen algorithm, on the average our approach achieve performance gain of

29% and 17%, respectively.

ETPL

BD - 004 An efficient key partitioning scheme for heterogeneous Map Reduce

clusters

It is cost-efficient for a tenant with a limited budget to establish a virtual MapReduce cluster by renting

multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling

scheme for this type of computing environment, we propose in this paper a hybrid job-driven

scheduling scheme (JoSS for short) from a tenant's perspective. JoSS provides not only job-level

scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies

MapReduce jobs based on job scale and job type and designs an appropriate scheduling policy to

schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks,

avoid job starvation, and improve job execution performance. Two variations of JoSS are further

introduced to separately achieve a better map-data locality and a faster task assignment. We conduct

extensive experiments to evaluate and compare the two variations with current scheduling algorithms

supported by Hadoop. The results show that the two variations outperform the other tested algorithms

in terms of map-data locality, reduce-data locality, and network overhead without incurring significant

overhead. In addition, the two variations are separately suitable for different MapReduce-workload

scenarios and provide the best job performance among all tested algorithms.

ETPL

BD -005 Hybrid Job-Driven Scheduling for Virtual Map Reduce Clusters

We present concepts which can be used for the efficient implementation of Attribute Based Access

Control (ABAC) in large applications using maybe several data storage technologies, including

Hadoop, NoSQL and relational database systems. The ABAC authorization process takes place in two

main stages. Firstly a sequence of permissions is derived which specifies permitted data to be retrieved

for the user's transaction. Secondly, query modification is used to augment the user's transaction with

code which implements the ABAC controls. This requires the storage technologies to support a high-

level language such as SQL or similar. The modified user transactions are then optimized and processed

using the full functionality of the underlying storage systems. We use an extended ABAC model

(TCM2) which handles negative permissions and overrides in a single permissions processing

mechanism. We illustrate these concepts using a compelling electronic health records scenario.

ETPL

BD - 006 Attribute Based Access Control for Big Data Applications by Query

Modification

DSePHR is proposed in this work in order to manage the encrypted PHR data on a cloud storage.

HBase and Hadoop are utilized in this work. The objective is to provide an API for any PHR system

to upload/download the encrypted PHR data from a cloud storage. The DSePHR resolves the

Namenode memory issues of HDFS when storing a lot of small files by classifying the encrypted PHR

data into small and large files. The small files will be handled by HBase schema that is proposed in

this work. The memory consumption and the processing time of the proposed DSePHR are evaluated

using real data sets collected from various healthcare communities.

ETPL

BD -007 Distributed storage design for encrypted personal health record data

Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic

parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to

this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the

MapReduce programming model. To achieve compressed storage and avoid building conditional

pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP

trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial

third MapReduce job, the mappers independently decompose itemsets, the reducers perform

combination operations by constructing small ultrametric trees, and the actual mining of these trees

separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster

is sensitive to data distribution and dimensions, because itemsets with different lengths have different

decomposition and construction costs. To improve FiDoop's performance, we develop a workload

balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD,

an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis.

Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution

is efficient and scalable.

ETPL

BD - 008 FiDoop: Parallel Mining of Frequent Item sets Using Map Reduce

In big-data-driven traffic flow prediction systems, the robustness of prediction performance depends

on accuracy and timeliness. This paper presents a new MapReduce-based nearest neighbor (NN)

approach for traffic flow prediction using correlation analysis (TFPC) on a Hadoop platform. In

particular, we develop a real-time prediction system including two key modules, i.e., offline distributed

training (ODT) and online parallel prediction (OPP). Moreover, we build a parallel -nearest neighbor

optimization classifier, which incorporates correlation information among traffic flows into the

classification process. Finally, we propose a novel prediction calculation method, combining the

current data observed in OPP and the classification results obtained from large-scale historical data in

ODT, to generate traffic flow prediction in real time. The empirical study on real-world traffic flow

big data using the leave-one-out cross validation method shows that TFPC significantly outperforms

four state-of-the-art prediction approaches, i.e., autoregressive integrated moving average, Naïve

Bayes, multilayer perceptron neural networks, and NN regression, in terms of accuracy, which can be

improved 90.07% in the best case, with an average mean absolute percent error of 5.53%. In addition,

it displays excellent speedup, scaleup, and sizeup.

ETPL

BD - 009 A Map Reduce-Based Nearest Neighbour Approach for Big-Data-Driven

Traffic Flow Prediction

As the adoption of Cloud Computing is growing exponentially, a huge sheer amount of data is

generated therefore needing to be processed in order to control efficiently what is going within the

infrastructure, and also to respond effectively and promptly to security threats. Herein, we provide a

highly scalable plugin based and comprehensive solution in order to have a real-time monitoring by

reducing the impact of an attack or a particular issue in the overall distributed infrastructure. This work

covers a bigger scope in infrastructure security by monitoring all devices that generate log files or

generate network traffic. By applying different Big Data techniques for data analysis, we can ensure a

responsive solution to any problem (security or other) within the infrastructure and acting accordingly.

ETPL

BD - 010 Toward a cloud-based security intelligence with big data processing

In this paper, we study the problems of scalability and performance for similarity search by proposing

eHSim, an efficient hybrid similarity search with MapReduce. More specifically, we introduce

clustering schemes that partition objects into different groups by their length. Additionally, we equip

our proposed schemes with pruning strategies that quickly discard irrelevant objects before truly

computing their similarity. Moreover, we design a hybrid MapReduce architecture that deals with

challenges from big data. Furthermore, we implement our proposed methods with MapReduce and

make them compatible with the hybrid MapReduce architecture. Last but not least, we evaluate the

proposed methods with real datasets. Empirical experiments show that our approach is considerably

more efficient than state-of-the-arts in terms of query processing, batch processing, and data storage.

ETPL

BD - 011 eHSim: An Efficient Hybrid Similarity Search with Map Reduce

Traffic for a typical MapReduce job in a data center consists of multiple network flows. Traditionally,

network resources have been allocated to optimize network-level metrics such as flow completion time

or throughput. Some recent schemes propose using application-aware scheduling which can shorten

the average job completion time. However, most of them treat the core network as a black box with

sufficient capacity. Even if only one network link in the core network becomes a bottleneck, it can hurt

application performance. We design and implement a centralized flow-scheduling framework called

Phurti with the goal of improving the completion time for jobs in a cluster shared among multiple

Hadoop jobs (multi-tenant). Phurti communicates both with the Hadoop framework to retrieve job-

level network traffic information and the OpenFlow-based switches to learn about the network

topology. Phurti implements a novel heuristic called Smallest Maximum Sequential-traffic First

(SMSF) that uses collected application and network information to perform traffic scheduling for

MapReduce jobs. Our evaluation with real Hadoop workloads shows that compared to application and

network-agnostic scheduling strategies, Phurti improves job completion time for 95% of the jobs,

decreases average job completion time by 20%, tail job completion time by 13% and scales well with

the cluster size and number of jobs.

ETPL

BD - 012 Phurti: Application and Network-Aware Flow Scheduling for Multi-

tenant Map Reduce Clusters

Erasure code based distributed storage systems are increasingly being used by storage providers for

big data storage since they offer the same reliability as replication with significant decrease in the

amount of storage required. But, when it comes to a storage system with data nodes spread across a

very large geographical area, the code's recovery performance is affected by various factors, both

network and computation related. In this paper, we expose an XOR based code supplemented with the

ideas of parity duplication and rack awareness that could be adopted in such storage clusters to improve

the recovery performance during node failures. We have implemented them on the erasure code module

of the XORBAS version of the Hadoop Distributed File System (HDFS). For evaluating the

performance of the proposed ideas, we employ a geo-diverse cluster on the NeCTAR research cloud.

The experimental results show that the techniques aid in bringing down the data read for repair by a

factor of 85% and repair duration by a factor of 57% during node failures, though resulting in an

increased storage requirement of 21% as compared to the traditional Reed-Solomon codes used in

HDFS. The sum of all these ideas could offer a better solution for a code based storage system spanning

a wide geographical area that has storage constraints such that a triple replicated system is not

affordable and at the same time has strict requirements on ensuring the minimal recovery time.

ETPL

BD - 013 On improving recovery performance in erasure code based geo-diverse

storage clusters

Users of cloud storage usually assign different redundancy configurations (i.e.,) of erasure codes,

depending on the desired balance between performance and fault tolerance. Our study finds that with

very low probability, one coding scheme chosen by rules of thumb, for a given redundancy

configuration, performs best. In this paper, we propose CaCo, an efficient Cauchy coding approach for

data storage in the cloud. First, CaCo uses Cauchy matrix heuristics to produce a matrix set. Second,

for each matrix in this set, CaCo uses XOR schedule heuristics to generate a series of schedules.

Finally, CaCo selects the shortest one from all the produced schedules. In such a way, CaCo has the

ability to identify an optimal coding scheme, within the capability of the current state of the art, for an

arbitrary given redundancy configuration. By leverage of CaCo's nature of ease to parallelize, we boost

significantly the performance of the selection process with abundant computational resources in the

cloud. We implement CaCo in the Hadoop distributed file system and evaluate its performance by

comparing with “Hadoop-EC” developed by Microsoft research. Our experimental results indicate that

CaCo can obtain an optimal coding scheme within acceptable time. Furthermore, CaCo outperforms

Hadoop-EC by 26.68-40.18 percent in the encoding time and by 38.4-52.83 percent in the decoding

time simultaneously.

ETPL

BD - 014 CaCo: An Efficient Cauchy Coding Approach for Cloud Storage

Systems

Infrastructure-as-a-service clouds are becoming widely adopted. However, resource sharing and multi-

tenancy have made performance anomalies a top concern for users. Timely debugging those anomalies

is paramount for minimizing the performance penalty for users. Unfortunately, this debugging often

takes a long time due to the inherent complexity and sharing nature of cloud infrastructures. When an

application experiences a performance anomaly, it is important to distinguish between faults with a

global impact and faults with a local impact as the diagnosis and recovery steps forfaults with a global

impact or local impact are quite different. In this paper, we present PerfCompass, an online

performance anomaly fault debugging tool that can quantify whether a production-run performance

anomaly has a global impact or local impact. PerfCompass can use this information to suggest the root

cause as either an external fault (e.g., environment-based) or an internal fault (e.g., software bugs).

Furthermore, PerfCompass can identify top affected system calls to provide useful diagnostic hints for

detailed performance debugging. PerfCompass does not require source code or runtime application

instrumentation, which makes it practical for production systems. We have tested PerfCompass by

running five common open source systems (e.g., Apache, MySQL, Tomcat, Hadoop, Cassandra) inside

a virtualized cloud testbed. Our experiments use a range of common infrastructure sharing issues and

real software bugs. The results show that PerfCompass accurately classifies 23 out of the 24 tested

cases without calibration and achieves 100 percent accuracy with calibration. PerfCompass provides

useful diagnosis hints within several minutes and imposes negligible runtime overhead to the

production system during normal execution time.

ETPL

BD - 015 PerfCompass: Online Performance Anomaly Fault Localization and

Inference in Infrastructure-as-a-Service Clouds

In recent years, big data have become a hot research topic. The increasing amount of big data also

increases the chance of breaching the privacy of individuals. Since big data require high computational

power and large storage, distributed systems are used. As multiple parties are involved in these systems,

the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms

developed for privacy protection at different stages (e.g., data generation, data storage, and data

processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of

the privacy preservation mechanisms in big data and present the challenges for existing mechanisms.

In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacy-

preserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges

and future research directions related to privacy preservation in big data.

ETPL

BD - 016 Protection of Big Data Privacy

With the advancement of information and communication technology, data are being generated at an

exponential rate via various instruments and collected at an unprecedented scale. Such large volume

of data generated is referred to as big data, which now are revolutionizing all aspects of our life ranging

from enterprises to individuals, from science communities to governments, as they exhibit great

potentials to improve efficiency of enterprises and the quality of life. To obtain nontrivial patterns and

derive valuable information from big data, a fundamental problem is how to properly place the

collected data by different users to distributed clouds and to efficiently analyze the collected data to

save user costs in data storage and processing, particularly the cost savings of users who share data.

By doing so, it needs the close collaborations among the users, by sharing and utilizing the big data in

distributed clouds due to the complexity and volume of big data. Since computing, storage and

bandwidth resources in a distributed cloud usually are limited, and such resource provisioning typically

is expensive, the collaborative users require to make use of the resources fairly. In this paper, we study

a novel collaboration- and fairness-aware big data management problem in distributed cloud

environments that aims to maximize the system throughout, while minimizing the operational cost of

service providers to achieve the system throughput, subject to resource capacity and user fairness

constraints. We first propose a novel optimization framework for the problem. We then devise a fast

yet scalable approximation algorithm based on the built optimization framework. We also analyze the

time complexity and approximation ratio of the proposed algorithm. We finally conduct experiments

by simulations to evaluate the performance of the proposed algorithm. Experimental results

demonstrate that the proposed algorithm is promising, and outperforms other heuristics.

ETPL

BD - 017 Collaboration- and Fairness-Aware Big Data Management in Distributed

Clouds

Big data are widely recognized as being one of the most powerful drivers to promote productivity,

improve efficiency, and support innovation. It is highly expected to explore the power of big data and

turn big data into big values. To answer the interesting question whether there are inherent correlations

between the two tendencies of big data and green challenges, a recent study has investigated the issues

on greening the whole life cycle of big data systems. This paper would like to discover the relations

between the trend of big data era and that of the new generation green revolution through a

comprehensive and panoramic literature survey in big data technologies toward various green

objectives and a discussion on relevant challenges and future directions.

ETPL

BD - 018 Big Data Meet Green Challenges: Big Data toward Green Applications

To improve the efficiency of big data feature learning, the paper proposes a privacy preserving deep

computation model by offloading the expensive operations to the cloud. Privacy concerns become

evident because there are a large number of private data by various applications in the smart city, such

as sensitive data of governments or proprietary information of enterprises. To protect the private data,

the proposed model uses the BGV encryption scheme to encrypt the private data and employs cloud

servers to perform the high-order back-propagation algorithm on the encrypted data efficiently for deep

computation model training. Furthermore, the proposed scheme approximates the sigmoid function as

a polynomial function to support the secure computation of the activation function with the BGV

encryption. In our scheme, only the encryption operations and the decryption operations are performed

by the client while all the computation tasks are performed on the cloud. Experimental results show

that our scheme is improved by approximately 2.5 times in the training efficiency compared to the

conventional deep computation model without disclosing the private data using the cloud computing

including ten nodes. More importantly, our scheme is highly scalable by employing more cloud servers,

which is particularly suitable for big data.

ETPL

BD - 019 Privacy Preserving Deep Computation Model on Cloud for Big Data

Feature Learning

Big Data is a term that describes the large volume of data both structured and unstructured that is

difficult to process using traditional database and software techniques. Cloud computing is a

technology that offers a solution to this problem. We have designed a cloud-based framework for

unstructured data analysis that is motivated by the goal of efficient image data analysis. The framework

consists of two general stages of feature extraction and machine learning that can be used in training

mode and classification mode. The framework uses sampling and feedback mechanisms whereby the

system learns which image features are most important, and also learns which algorithm(s) is best

(under user criteria of accuracy and speed). This information allows the system to be more efficient by

automatically reducing the number of features captured, and number of algorithms being used to

evaluate the data. While there is an overhead to the auto-adjusting mechanisms, they do lead to a more

efficient solution for big data sets. The solution is evaluated using a leaf images classification problem,

and the results shows the improvements in efficiency.

ETPL

BD - 020 Efficient Cloud-Based Framework for Big Data Classification

The MapReduce programming model simplifies large-scale data processing on commodity cluster by

exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the

performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which

plays a critical role in performance enhancement. Traditionally, a hash function is used to partition

intermediate data among reduce tasks, which, however, is not traffic-efficient because network

topology and data size associated with each key are not taken into consideration. In this paper, we study

to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition

scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator

can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is

proposed to deal with the large-scale optimization problem for big data application and an online

algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally,

extensive simulation results demonstrate that our proposals can significantly reduce network traffic

cost under both offline and online cases.

ETPL

BD - 022 On Traffic-Aware Partition and Aggregation in Map Reduce for Big

Data Applications

Big Data distribution has benefited from the Cloud resources to accommodate application's QoS

requirements. In this paper, we propose Big Data distribution scheme that matches the Cloud available

resources to guarantee application's QoS given the continuously dynamic and varying resources of the

Cloud infrastructure. We developed Two-Level QoS Policies (TLPS) for selecting clusters and nodes

while satisfying the client's application QoS. We also proposed an adaptive data distribution algorithm

to cope with changing QoS requirements. Experiments have been conducted to evaluate both the

effectiveness and the communication overhead of our proposed distribution scheme and the results we

have reported are convincing. Other experiments evaluated our TLPS algorithm against other single-

based QoS data distribution algorithms and the results show that TLPS algorithm adapts to the

customer QoS requirements.

ETPL

BD - 021 Policy-Based QoS Enforcement for Adaptive Big Data Distribution on

the Cloud

How to control the access of the huge amount of big data becomes a very challenging issue, especially

when big data are stored in the cloud. Ciphertext-Policy Attributebased Encryption (CP-ABE) is a

promising encryption technique that enables end-users to encrypt their data under the access policies

defined over some attributes of data consumers and only allows data consumers whose attributes satisfy

the access policies to decrypt the data. In CP-ABE, the access policy is attached to the ciphertext in

plaintext form, which may also leak some private information about end-users. Existing methods only

partially hide the attribute values in the access policies, while the attribute names are still unprotected.

In this paper, we propose an efficient and fine-grained big data access control scheme with privacy-

preserving policy. Specifically, we hide the whole attribute (rather than only its values) in the access

policies. To assist data decryption, we also design a novel Attribute Bloom Filter to evaluate whether

an attribute is in the access policy and locate the exact position in the access policy if it is in the access

policy. Security analysis and performance evaluation show that our scheme can preserve the privacy

from any LSSS access policy without employing much overhead.

ETPL

BD - 023 An Efficient and Fine-grained Big Data Access Control Scheme with

Privacy-preserving Policy

Big-data computing is a new critical challenge for the ICT industry. Engineers and researchers are

dealing with data sets of petabyte scale in the cloud computing paradigm. Thus, the demand for

building a service stack to distribute, manage, and process massive data sets has risen drastically. In

this paper, we investigate the Big Data Broadcasting problem for a single source node to broadcast a

big chunk of data to a set of nodes with the objective of minimizing the maximum completion time.

These nodes may locate in the same datacenter or across geo-distributed datacenters. This problem is

one of the fundamental problems in distributed computing and is known to be NP-hard in

heterogeneous environments. We model the Big-data broadcasting problem into a LockStep Broadcast

Tree (LSBT) problem. The main idea of the LSBT model is to define a basic unit of upload bandwidth,

r, such that a node with capacity c broadcasts data to a set of [c/r] children at the rater. Note that r is a

parameter to be optimized as part of the LSBT problem. We further divide the broadcast data into m

chunks. These data chunks can then be broadcast down the LSBT in a pipeline manner. In a

homogeneous network environment in which each node has the same upload capacity c, we show that

the optimal uplink rate r* of LSBT is either c/2 or c/3, whichever gives the smaller maximum

completion time. For heterogeneous environments, we present an O(nlog2n) algorithm to select an

optimal uplink rater* and to construct an optimal LSBT. Numerical results show that our approach

performs well with less maximum completion time and lower computational complexity than other

efficient solutions in literature.

ETPL

BD - 024 A Novel Pipeline Approach for Efficient Big Data Broadcasting

Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a

computational intensive task with a large range of applications such as knowledge discovery or data

mining. However, as the volume and the dimension of data increase, only distributed approaches can

perform such costly operation in a reasonable time. Recent works have focused on implementing

efficient solutions using the MapReduce programming model because it is suitable for distributed large

scale data processing. Although these works provide different solutions to the same problem, each one

has particular constraints and properties. In this paper, we compare the different existing approaches

for computing kNN on MapReduce, first theoretically, and then by performing an extensive

experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN

computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze

each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety

of datasets, and analyze the impact of data volume, data dimension and the value of k from many

perspectives like time and space complexity, and accuracy. The experimental part brings new

advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this

is the first paper that compares kNN computing methods on MapReduce both theoretically and

experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based

practical problems in the context of big data.

ETPL

BD - 025 K Nearest Neighbour Joins for Big Data on Map Reduce: a Theoretical

and Experimental Analysis

This paper introduces HadoopViz; a MapReduce-based framework for visualizing big spatial data.

HadoopViz has three unique features that distinguish it from other techniques. (1) It exposes an

extensible interface which allows users to define a new visualization types, e.g., scatter plot, road

network, or heat map, by defining five abstract functions, without delving into the implementation

details of the MapReduce algorithms. As it is open source, HadoopViz allows algorithm designers to

focus on how the data should be visualized rather than performance or scalability issues. (2) HadoopViz

is capable of generating big images with giga-pixel resolution by employing a three-phase technique,

partition-plot-merge. (3) HadoopViz provides a smoothing functionality which can fuse nearby records

together as the image is plotted. This makes it capable of generating more types of images with high

quality as compared to existing work. Experimental results on real datasets of up to 14 Billion points

show the extensibility, scalability, and efficiency of HadoopViz to handle different visualization types

of spatial big data.

ETPL

BD - 026 HadoopViz: A Map Reduce framework for extensible visualization of

big spatial data

The global deployment of cloud datacenters is enabling large scale scientific workflows to improve

performance and deliver fast responses. This unprecedented geographical distribution of the

computation is doubled by an increase in the scale of the data handled by such applications, bringing

new challenges related to the efficient data management across sites. High throughput, low latencies

or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes to

handling data across datacenters. Existing solutions are limited to cloud-provided storage, which offers

low performance based on rigid cost schemes. In turn, workflow engines need to improvise substitutes,

achieving performance at the cost of complex system configurations, maintenance overheads, reduced

reliability and reusability. In this paper, we introduce OverFlow, a uniform data management system

for scientific workflows running across geographically distributed sites, aiming to reap economic

benefits from this geo-diversity. Our solution is environment-aware, as it monitors and models the

global cloud infrastructure, offering high and predictable data handling performance for transfer cost

and time, within and across sites. OverFlow proposes a set of pluggable services, grouped in a data

scientist cloud kit. They provide the applications with the possibility to monitor the underlying

infrastructure, to exploit smart data compression, deduplication and geo-replication, to evaluate data

management costs, to set a tradeoff between money and time, and optimize the transfer strategy

accordingly. The system was validated on the Microsoft Azure cloud across its 6 EU and US

datacenters. The experiments were conducted on hundreds of nodes using synthetic benchmarks and

real-life bio-informatics applications (A-Brain, BLAST). The results show that our system is able to

model accurately the cloud performance and to leverage this for efficient data dissemination, being

able to reduc- the monetary costs and transfer time by up to three times.

ETPL

BD - 027 OverFlow: Multi-Site Aware Big Data Management for Scientific

Workflows on Clouds

Trajectory data are prevalent in systems that monitor the locations of moving objects. In a location-

based service, for instance, the positions of vehicles are continuously monitored through GPS; the

trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories,

generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest

neighbors from R. We examine how this query can be evaluated in cloud environments. This problem

is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal

dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution

framework based on MapReduce. We also develop a novel bounding technique, which enables

trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine

trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have

performed extensive experiments on a real dataset.

ETPL

BD - 028 Scalable Algorithms for Nearest-Neighbour Joins on Big Trajectory Data

Crowdsourcing is a process of acquisition, integration, and analysis of big and heterogeneous data

generated by a diversity of sources in urban spaces, such as sensors, devices, vehicles, buildings, and

human. Especially, nowadays, no countries, no communities, and no person are immune to urban

emergency events. Detection about urban emergency events, e.g., fires, storms, traffic jams is of great

importance to protect the security of humans. Recently, social media feeds are rapidly emerging as a

novel platform for providing and dissemination of information that is often geographic. The content

from social media usually includes references to urban emergency events occurring at, or affecting

specific locations. In this paper, in order to detect and describe the real time urban emergency event,

the 5W (What, Where, When, Who, and Why) model is proposed. Firstly, users of social media are set

as the target of crowd sourcing. Secondly, the spatial and temporal information from the social media

are extracted to detect the real time event. Thirdly, a GIS based annotation of the detected urban

emergency event is shown. The proposed method is evaluated with extensive case studies based on

real urban emergency events. The results show the accuracy and efficiency of the proposed method.

ETPL

BD - 029 Crowdsourcing based Description of Urban Emergency Events using

Social Media Big Data

As new data and updates are constantly arriving, the results of data mining applications become stale

and obsolete over time. Incremental processing is a promising approach to refresh mining results. It

utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we

propose i2MapReduce, a novel incremental processing extension to MapReduce. Compared with the

state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing

rather than task level re-computation, (ii) supports not only one-step computation but also more

sophisticated iterative computation, and (iii) incorporates a set of novel techniques to reduce I/O

overhead for accessing preserved fine-grain computation states. Experimental results on Amazon EC2

show significant performance improvements of i2MapReduce compared to both plain and iterative

MapReduce performing re-computation.

ETPL

BD - 030 i2MapReduce: Incremental map reduce for mining evolving big data

Trajectory data are prevalent in systems that monitor the locations of moving objects. In a location-

based service, for instance, the positions of vehicles are continuously monitored through GPS; the

trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories,

generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest

neighbors from R. We examine how this query can be evaluated in cloud environments. This problem

is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal

dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution

framework based on MapReduce. We also develop a novel bounding technique, which enables

trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine

trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have

performed extensive experiments on a real dataset.

ETPL

BD - 031 Scalable algorithms for nearest-neighbour joins on big trajectory data

Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy

preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One

challenge is that the relationship between documents will be normally concealed in the process of

encryption, which will lead to significant search accuracy performance degradation. Also the volume

of data in data centers has experienced a dramatic growth. This will make it even more challenging to

design ciphertext search schemes that can provide efficient and reliable online information retrieval on

large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support

more search semantics and also to meet the demand for fast ciphertext search within a big data

environment. The proposed hierarchical approach clusters the documents based on the minimum

relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on

the maximum size of cluster is reached. In the search phase, this approach can reach a linear

computational complexity against an exponential size increase of document collection. In order to

verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this

paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The

results show that with a sharp increase of documents in the dataset the search time of the proposed

method increases linearly whereas the search time of the traditional method increases exponentially.

Furthermore, the proposed method has an advantage over the traditional method in the rank privacy

and relevance of retrieved documents.

ETPL

BD - 032 An Efficient Privacy-Preserving Ranked Keyword Search Method