TOP updated Latest List of Hadoop Big Data projects 2016 - 2017 | Hadoop big data latest final Year...
-
Upload
elysium-technologies-private-ltd -
Category
Education
-
view
73 -
download
0
Transcript of TOP updated Latest List of Hadoop Big Data projects 2016 - 2017 | Hadoop big data latest final Year...
Cloud Computing leverages Hadoop framework for processing BigData in parallel. Hadoop has certain
limitations that could be exploited to execute the job efficiently. These limitations are mostly because
of data locality in the cluster, jobs and tasks scheduling, and resource allocations in Hadoop. Efficient
resource allocation remains a challenge in Cloud Computing MapReduce platforms. We propose
H2Hadoop, which is an enhanced Hadoop architecture that reduces the computation cost associated
with BigData analysis. The proposed architecture also addresses the issue of resource allocation in
native Hadoop. H2Hadoop provides a better solution for “text data”, such as finding DNA sequence
and the motif of a DNA sequence. Also, H2Hadoop provides an efficient Data Mining approach for
Cloud Computing environments. H2Hadoop architecture leverages on NameNode’s ability to assign
jobs to the TaskTrakers (DataNodes) within the cluster. By adding control features to the NameNode,
H2Hadoop can intelligently direct and assign tasks to the DataNodes that contain the required data
without sending the job to the whole cluster. Comparing with native Hadoop, H2Hadoop reduces CPU
time, number of read operations, and another Hadoop factors.
ETPL
BD -001 H2Hadoop: Improving Hadoop Performance using the Metadata of
Related Jobs
Hadoop has two components namely HDFS and MapReduce. Hadoop stores user data based on space
utilization of datanodes on the cluster rather than the processing capability of the datanodes.
Furthermore Hadoop runs in a heterogeneous environment as all datanodes may not be homogeneous.
For these reasons, workload imbalances will occur when jobs run in a Hadoop cluster resulting in poor
performance. In this paper, we propose a dynamic algorithm to balance the workload between different
racks on a Hadoop cluster based on the log files of Hadoop. However, if the tasks are executing on
critical or sensitive data in a secured rack, the data transfer to an unsecured cluster will result in privacy
being compromised. We propose an approach to transfer data between racks without disclosing private
information. Moving tasks from the most overloaded rack to another rack improves the performance
of MapReduce jobs. Our simulations indicate that the proposed algorithm decreases running time of a
job by more than 50% running on the most overload rack.
ETPL
BD - 002 Privacy Preserving Rack-Based Dynamic Workload Balancing for
Hadoop Map Reduce
Big data analysis has gained significant popularity within the last few years. The MapReduce
framework presented by Google makes it easier to write applications that process vast amount of data.
MapReduce targets at large commodity clusters where failures are not exceptions. However, Hadoop,
the most popular implementation of MapReduce performs poorly under failures. Hadoop implements
the fault tolerance strategy at the task level, as a result, a task failure will require a re-execution of the
whole task regardless of how much input has already been processed. In this paper, we present BeTL
which introduces slight changes to the execution flow of MapReduce, and makes it possible to gain a
finer-grained fault tolerance. Map tasks can create checkpoints so that a retrying task doesn't have to
start from scratch and thus saves much time. Speculation strategy can also benefit from this. The new
execution flow involves less IO operations and performs better than Hadoop even under no failures. In
our experiments, BeTL outperforms Hadoop by 6.6 percent on average under no failures and 4.6 to
51.0 percent under different failure densities.
ETPL
BD -003 BeTL: Map Reduce Checkpoint Tactics Beneath the Task Level
Hadoop is a standard implementation of MapReduce framework for running data-intensive
applications on the clusters of commodity servers. By thoroughly studying the framework we find out
that the shuffle phase, all-to-all input data fetching phase in reduce task significantly affect the
application performance. There is a problem of variance in both the intermediate key's frequencies and
their distribution among data nodes throughout the cluster in Hadoop's MapReduce system. This
variance in system causes network overhead which leads to unfairness on the reduce input among
different data nodes in the cluster. Because of the above problem, applications experience performance
degradation due to shuffle phase of MapReduce applications. We develop a new novel algorithm;
unlike previous systems our algorithm considers a node's capabilities as heuristics to decide a better
available trade-off for the locality and fairness in the system. By comparing with the default Hadoop's
partitioning algorithm and Leen algorithm, on the average our approach achieve performance gain of
29% and 17%, respectively.
ETPL
BD - 004 An efficient key partitioning scheme for heterogeneous Map Reduce
clusters
It is cost-efficient for a tenant with a limited budget to establish a virtual MapReduce cluster by renting
multiple virtual private servers (VPSs) from a VPS provider. To provide an appropriate scheduling
scheme for this type of computing environment, we propose in this paper a hybrid job-driven
scheduling scheme (JoSS for short) from a tenant's perspective. JoSS provides not only job-level
scheduling, but also map-task level scheduling and reduce-task level scheduling. JoSS classifies
MapReduce jobs based on job scale and job type and designs an appropriate scheduling policy to
schedule each class of jobs. The goal is to improve data locality for both map tasks and reduce tasks,
avoid job starvation, and improve job execution performance. Two variations of JoSS are further
introduced to separately achieve a better map-data locality and a faster task assignment. We conduct
extensive experiments to evaluate and compare the two variations with current scheduling algorithms
supported by Hadoop. The results show that the two variations outperform the other tested algorithms
in terms of map-data locality, reduce-data locality, and network overhead without incurring significant
overhead. In addition, the two variations are separately suitable for different MapReduce-workload
scenarios and provide the best job performance among all tested algorithms.
ETPL
BD -005 Hybrid Job-Driven Scheduling for Virtual Map Reduce Clusters
We present concepts which can be used for the efficient implementation of Attribute Based Access
Control (ABAC) in large applications using maybe several data storage technologies, including
Hadoop, NoSQL and relational database systems. The ABAC authorization process takes place in two
main stages. Firstly a sequence of permissions is derived which specifies permitted data to be retrieved
for the user's transaction. Secondly, query modification is used to augment the user's transaction with
code which implements the ABAC controls. This requires the storage technologies to support a high-
level language such as SQL or similar. The modified user transactions are then optimized and processed
using the full functionality of the underlying storage systems. We use an extended ABAC model
(TCM2) which handles negative permissions and overrides in a single permissions processing
mechanism. We illustrate these concepts using a compelling electronic health records scenario.
ETPL
BD - 006 Attribute Based Access Control for Big Data Applications by Query
Modification
DSePHR is proposed in this work in order to manage the encrypted PHR data on a cloud storage.
HBase and Hadoop are utilized in this work. The objective is to provide an API for any PHR system
to upload/download the encrypted PHR data from a cloud storage. The DSePHR resolves the
Namenode memory issues of HDFS when storing a lot of small files by classifying the encrypted PHR
data into small and large files. The small files will be handled by HBase schema that is proposed in
this work. The memory consumption and the processing time of the proposed DSePHR are evaluated
using real data sets collected from various healthcare communities.
ETPL
BD -007 Distributed storage design for encrypted personal health record data
Existing parallel mining algorithms for frequent itemsets lack a mechanism that enables automatic
parallelization, load balancing, data distribution, and fault tolerance on large clusters. As a solution to
this problem, we design a parallel frequent itemsets mining algorithm called FiDoop using the
MapReduce programming model. To achieve compressed storage and avoid building conditional
pattern bases, FiDoop incorporates the frequent items ultrametric tree, rather than conventional FP
trees. In FiDoop, three MapReduce jobs are implemented to complete the mining task. In the crucial
third MapReduce job, the mappers independently decompose itemsets, the reducers perform
combination operations by constructing small ultrametric trees, and the actual mining of these trees
separately. We implement FiDoop on our in-house Hadoop cluster. We show that FiDoop on the cluster
is sensitive to data distribution and dimensions, because itemsets with different lengths have different
decomposition and construction costs. To improve FiDoop's performance, we develop a workload
balance metric to measure load balance across the cluster's computing nodes. We develop FiDoop-HD,
an extension of FiDoop, to speed up the mining performance for high-dimensional data analysis.
Extensive experiments using real-world celestial spectral data demonstrate that our proposed solution
is efficient and scalable.
ETPL
BD - 008 FiDoop: Parallel Mining of Frequent Item sets Using Map Reduce
In big-data-driven traffic flow prediction systems, the robustness of prediction performance depends
on accuracy and timeliness. This paper presents a new MapReduce-based nearest neighbor (NN)
approach for traffic flow prediction using correlation analysis (TFPC) on a Hadoop platform. In
particular, we develop a real-time prediction system including two key modules, i.e., offline distributed
training (ODT) and online parallel prediction (OPP). Moreover, we build a parallel -nearest neighbor
optimization classifier, which incorporates correlation information among traffic flows into the
classification process. Finally, we propose a novel prediction calculation method, combining the
current data observed in OPP and the classification results obtained from large-scale historical data in
ODT, to generate traffic flow prediction in real time. The empirical study on real-world traffic flow
big data using the leave-one-out cross validation method shows that TFPC significantly outperforms
four state-of-the-art prediction approaches, i.e., autoregressive integrated moving average, Naïve
Bayes, multilayer perceptron neural networks, and NN regression, in terms of accuracy, which can be
improved 90.07% in the best case, with an average mean absolute percent error of 5.53%. In addition,
it displays excellent speedup, scaleup, and sizeup.
ETPL
BD - 009 A Map Reduce-Based Nearest Neighbour Approach for Big-Data-Driven
Traffic Flow Prediction
As the adoption of Cloud Computing is growing exponentially, a huge sheer amount of data is
generated therefore needing to be processed in order to control efficiently what is going within the
infrastructure, and also to respond effectively and promptly to security threats. Herein, we provide a
highly scalable plugin based and comprehensive solution in order to have a real-time monitoring by
reducing the impact of an attack or a particular issue in the overall distributed infrastructure. This work
covers a bigger scope in infrastructure security by monitoring all devices that generate log files or
generate network traffic. By applying different Big Data techniques for data analysis, we can ensure a
responsive solution to any problem (security or other) within the infrastructure and acting accordingly.
ETPL
BD - 010 Toward a cloud-based security intelligence with big data processing
In this paper, we study the problems of scalability and performance for similarity search by proposing
eHSim, an efficient hybrid similarity search with MapReduce. More specifically, we introduce
clustering schemes that partition objects into different groups by their length. Additionally, we equip
our proposed schemes with pruning strategies that quickly discard irrelevant objects before truly
computing their similarity. Moreover, we design a hybrid MapReduce architecture that deals with
challenges from big data. Furthermore, we implement our proposed methods with MapReduce and
make them compatible with the hybrid MapReduce architecture. Last but not least, we evaluate the
proposed methods with real datasets. Empirical experiments show that our approach is considerably
more efficient than state-of-the-arts in terms of query processing, batch processing, and data storage.
ETPL
BD - 011 eHSim: An Efficient Hybrid Similarity Search with Map Reduce
Traffic for a typical MapReduce job in a data center consists of multiple network flows. Traditionally,
network resources have been allocated to optimize network-level metrics such as flow completion time
or throughput. Some recent schemes propose using application-aware scheduling which can shorten
the average job completion time. However, most of them treat the core network as a black box with
sufficient capacity. Even if only one network link in the core network becomes a bottleneck, it can hurt
application performance. We design and implement a centralized flow-scheduling framework called
Phurti with the goal of improving the completion time for jobs in a cluster shared among multiple
Hadoop jobs (multi-tenant). Phurti communicates both with the Hadoop framework to retrieve job-
level network traffic information and the OpenFlow-based switches to learn about the network
topology. Phurti implements a novel heuristic called Smallest Maximum Sequential-traffic First
(SMSF) that uses collected application and network information to perform traffic scheduling for
MapReduce jobs. Our evaluation with real Hadoop workloads shows that compared to application and
network-agnostic scheduling strategies, Phurti improves job completion time for 95% of the jobs,
decreases average job completion time by 20%, tail job completion time by 13% and scales well with
the cluster size and number of jobs.
ETPL
BD - 012 Phurti: Application and Network-Aware Flow Scheduling for Multi-
tenant Map Reduce Clusters
Erasure code based distributed storage systems are increasingly being used by storage providers for
big data storage since they offer the same reliability as replication with significant decrease in the
amount of storage required. But, when it comes to a storage system with data nodes spread across a
very large geographical area, the code's recovery performance is affected by various factors, both
network and computation related. In this paper, we expose an XOR based code supplemented with the
ideas of parity duplication and rack awareness that could be adopted in such storage clusters to improve
the recovery performance during node failures. We have implemented them on the erasure code module
of the XORBAS version of the Hadoop Distributed File System (HDFS). For evaluating the
performance of the proposed ideas, we employ a geo-diverse cluster on the NeCTAR research cloud.
The experimental results show that the techniques aid in bringing down the data read for repair by a
factor of 85% and repair duration by a factor of 57% during node failures, though resulting in an
increased storage requirement of 21% as compared to the traditional Reed-Solomon codes used in
HDFS. The sum of all these ideas could offer a better solution for a code based storage system spanning
a wide geographical area that has storage constraints such that a triple replicated system is not
affordable and at the same time has strict requirements on ensuring the minimal recovery time.
ETPL
BD - 013 On improving recovery performance in erasure code based geo-diverse
storage clusters
Users of cloud storage usually assign different redundancy configurations (i.e.,) of erasure codes,
depending on the desired balance between performance and fault tolerance. Our study finds that with
very low probability, one coding scheme chosen by rules of thumb, for a given redundancy
configuration, performs best. In this paper, we propose CaCo, an efficient Cauchy coding approach for
data storage in the cloud. First, CaCo uses Cauchy matrix heuristics to produce a matrix set. Second,
for each matrix in this set, CaCo uses XOR schedule heuristics to generate a series of schedules.
Finally, CaCo selects the shortest one from all the produced schedules. In such a way, CaCo has the
ability to identify an optimal coding scheme, within the capability of the current state of the art, for an
arbitrary given redundancy configuration. By leverage of CaCo's nature of ease to parallelize, we boost
significantly the performance of the selection process with abundant computational resources in the
cloud. We implement CaCo in the Hadoop distributed file system and evaluate its performance by
comparing with “Hadoop-EC” developed by Microsoft research. Our experimental results indicate that
CaCo can obtain an optimal coding scheme within acceptable time. Furthermore, CaCo outperforms
Hadoop-EC by 26.68-40.18 percent in the encoding time and by 38.4-52.83 percent in the decoding
time simultaneously.
ETPL
BD - 014 CaCo: An Efficient Cauchy Coding Approach for Cloud Storage
Systems
Infrastructure-as-a-service clouds are becoming widely adopted. However, resource sharing and multi-
tenancy have made performance anomalies a top concern for users. Timely debugging those anomalies
is paramount for minimizing the performance penalty for users. Unfortunately, this debugging often
takes a long time due to the inherent complexity and sharing nature of cloud infrastructures. When an
application experiences a performance anomaly, it is important to distinguish between faults with a
global impact and faults with a local impact as the diagnosis and recovery steps forfaults with a global
impact or local impact are quite different. In this paper, we present PerfCompass, an online
performance anomaly fault debugging tool that can quantify whether a production-run performance
anomaly has a global impact or local impact. PerfCompass can use this information to suggest the root
cause as either an external fault (e.g., environment-based) or an internal fault (e.g., software bugs).
Furthermore, PerfCompass can identify top affected system calls to provide useful diagnostic hints for
detailed performance debugging. PerfCompass does not require source code or runtime application
instrumentation, which makes it practical for production systems. We have tested PerfCompass by
running five common open source systems (e.g., Apache, MySQL, Tomcat, Hadoop, Cassandra) inside
a virtualized cloud testbed. Our experiments use a range of common infrastructure sharing issues and
real software bugs. The results show that PerfCompass accurately classifies 23 out of the 24 tested
cases without calibration and achieves 100 percent accuracy with calibration. PerfCompass provides
useful diagnosis hints within several minutes and imposes negligible runtime overhead to the
production system during normal execution time.
ETPL
BD - 015 PerfCompass: Online Performance Anomaly Fault Localization and
Inference in Infrastructure-as-a-Service Clouds
In recent years, big data have become a hot research topic. The increasing amount of big data also
increases the chance of breaching the privacy of individuals. Since big data require high computational
power and large storage, distributed systems are used. As multiple parties are involved in these systems,
the risk of privacy violation is increased. There have been a number of privacy-preserving mechanisms
developed for privacy protection at different stages (e.g., data generation, data storage, and data
processing) of a big data life cycle. The goal of this paper is to provide a comprehensive overview of
the privacy preservation mechanisms in big data and present the challenges for existing mechanisms.
In particular, in this paper, we illustrate the infrastructure of big data and the state-of-the-art privacy-
preserving mechanisms in each stage of the big data life cycle. Furthermore, we discuss the challenges
and future research directions related to privacy preservation in big data.
ETPL
BD - 016 Protection of Big Data Privacy
With the advancement of information and communication technology, data are being generated at an
exponential rate via various instruments and collected at an unprecedented scale. Such large volume
of data generated is referred to as big data, which now are revolutionizing all aspects of our life ranging
from enterprises to individuals, from science communities to governments, as they exhibit great
potentials to improve efficiency of enterprises and the quality of life. To obtain nontrivial patterns and
derive valuable information from big data, a fundamental problem is how to properly place the
collected data by different users to distributed clouds and to efficiently analyze the collected data to
save user costs in data storage and processing, particularly the cost savings of users who share data.
By doing so, it needs the close collaborations among the users, by sharing and utilizing the big data in
distributed clouds due to the complexity and volume of big data. Since computing, storage and
bandwidth resources in a distributed cloud usually are limited, and such resource provisioning typically
is expensive, the collaborative users require to make use of the resources fairly. In this paper, we study
a novel collaboration- and fairness-aware big data management problem in distributed cloud
environments that aims to maximize the system throughout, while minimizing the operational cost of
service providers to achieve the system throughput, subject to resource capacity and user fairness
constraints. We first propose a novel optimization framework for the problem. We then devise a fast
yet scalable approximation algorithm based on the built optimization framework. We also analyze the
time complexity and approximation ratio of the proposed algorithm. We finally conduct experiments
by simulations to evaluate the performance of the proposed algorithm. Experimental results
demonstrate that the proposed algorithm is promising, and outperforms other heuristics.
ETPL
BD - 017 Collaboration- and Fairness-Aware Big Data Management in Distributed
Clouds
Big data are widely recognized as being one of the most powerful drivers to promote productivity,
improve efficiency, and support innovation. It is highly expected to explore the power of big data and
turn big data into big values. To answer the interesting question whether there are inherent correlations
between the two tendencies of big data and green challenges, a recent study has investigated the issues
on greening the whole life cycle of big data systems. This paper would like to discover the relations
between the trend of big data era and that of the new generation green revolution through a
comprehensive and panoramic literature survey in big data technologies toward various green
objectives and a discussion on relevant challenges and future directions.
ETPL
BD - 018 Big Data Meet Green Challenges: Big Data toward Green Applications
To improve the efficiency of big data feature learning, the paper proposes a privacy preserving deep
computation model by offloading the expensive operations to the cloud. Privacy concerns become
evident because there are a large number of private data by various applications in the smart city, such
as sensitive data of governments or proprietary information of enterprises. To protect the private data,
the proposed model uses the BGV encryption scheme to encrypt the private data and employs cloud
servers to perform the high-order back-propagation algorithm on the encrypted data efficiently for deep
computation model training. Furthermore, the proposed scheme approximates the sigmoid function as
a polynomial function to support the secure computation of the activation function with the BGV
encryption. In our scheme, only the encryption operations and the decryption operations are performed
by the client while all the computation tasks are performed on the cloud. Experimental results show
that our scheme is improved by approximately 2.5 times in the training efficiency compared to the
conventional deep computation model without disclosing the private data using the cloud computing
including ten nodes. More importantly, our scheme is highly scalable by employing more cloud servers,
which is particularly suitable for big data.
ETPL
BD - 019 Privacy Preserving Deep Computation Model on Cloud for Big Data
Feature Learning
Big Data is a term that describes the large volume of data both structured and unstructured that is
difficult to process using traditional database and software techniques. Cloud computing is a
technology that offers a solution to this problem. We have designed a cloud-based framework for
unstructured data analysis that is motivated by the goal of efficient image data analysis. The framework
consists of two general stages of feature extraction and machine learning that can be used in training
mode and classification mode. The framework uses sampling and feedback mechanisms whereby the
system learns which image features are most important, and also learns which algorithm(s) is best
(under user criteria of accuracy and speed). This information allows the system to be more efficient by
automatically reducing the number of features captured, and number of algorithms being used to
evaluate the data. While there is an overhead to the auto-adjusting mechanisms, they do lead to a more
efficient solution for big data sets. The solution is evaluated using a leaf images classification problem,
and the results shows the improvements in efficiency.
ETPL
BD - 020 Efficient Cloud-Based Framework for Big Data Classification
The MapReduce programming model simplifies large-scale data processing on commodity cluster by
exploiting parallel map tasks and reduce tasks. Although many efforts have been made to improve the
performance of MapReduce jobs, they ignore the network traffic generated in the shuffle phase, which
plays a critical role in performance enhancement. Traditionally, a hash function is used to partition
intermediate data among reduce tasks, which, however, is not traffic-efficient because network
topology and data size associated with each key are not taken into consideration. In this paper, we study
to reduce network traffic cost for a MapReduce job by designing a novel intermediate data partition
scheme. Furthermore, we jointly consider the aggregator placement problem, where each aggregator
can reduce merged traffic from multiple map tasks. A decomposition-based distributed algorithm is
proposed to deal with the large-scale optimization problem for big data application and an online
algorithm is also designed to adjust data partition and aggregation in a dynamic manner. Finally,
extensive simulation results demonstrate that our proposals can significantly reduce network traffic
cost under both offline and online cases.
ETPL
BD - 022 On Traffic-Aware Partition and Aggregation in Map Reduce for Big
Data Applications
Big Data distribution has benefited from the Cloud resources to accommodate application's QoS
requirements. In this paper, we propose Big Data distribution scheme that matches the Cloud available
resources to guarantee application's QoS given the continuously dynamic and varying resources of the
Cloud infrastructure. We developed Two-Level QoS Policies (TLPS) for selecting clusters and nodes
while satisfying the client's application QoS. We also proposed an adaptive data distribution algorithm
to cope with changing QoS requirements. Experiments have been conducted to evaluate both the
effectiveness and the communication overhead of our proposed distribution scheme and the results we
have reported are convincing. Other experiments evaluated our TLPS algorithm against other single-
based QoS data distribution algorithms and the results show that TLPS algorithm adapts to the
customer QoS requirements.
ETPL
BD - 021 Policy-Based QoS Enforcement for Adaptive Big Data Distribution on
the Cloud
How to control the access of the huge amount of big data becomes a very challenging issue, especially
when big data are stored in the cloud. Ciphertext-Policy Attributebased Encryption (CP-ABE) is a
promising encryption technique that enables end-users to encrypt their data under the access policies
defined over some attributes of data consumers and only allows data consumers whose attributes satisfy
the access policies to decrypt the data. In CP-ABE, the access policy is attached to the ciphertext in
plaintext form, which may also leak some private information about end-users. Existing methods only
partially hide the attribute values in the access policies, while the attribute names are still unprotected.
In this paper, we propose an efficient and fine-grained big data access control scheme with privacy-
preserving policy. Specifically, we hide the whole attribute (rather than only its values) in the access
policies. To assist data decryption, we also design a novel Attribute Bloom Filter to evaluate whether
an attribute is in the access policy and locate the exact position in the access policy if it is in the access
policy. Security analysis and performance evaluation show that our scheme can preserve the privacy
from any LSSS access policy without employing much overhead.
ETPL
BD - 023 An Efficient and Fine-grained Big Data Access Control Scheme with
Privacy-preserving Policy
Big-data computing is a new critical challenge for the ICT industry. Engineers and researchers are
dealing with data sets of petabyte scale in the cloud computing paradigm. Thus, the demand for
building a service stack to distribute, manage, and process massive data sets has risen drastically. In
this paper, we investigate the Big Data Broadcasting problem for a single source node to broadcast a
big chunk of data to a set of nodes with the objective of minimizing the maximum completion time.
These nodes may locate in the same datacenter or across geo-distributed datacenters. This problem is
one of the fundamental problems in distributed computing and is known to be NP-hard in
heterogeneous environments. We model the Big-data broadcasting problem into a LockStep Broadcast
Tree (LSBT) problem. The main idea of the LSBT model is to define a basic unit of upload bandwidth,
r, such that a node with capacity c broadcasts data to a set of [c/r] children at the rater. Note that r is a
parameter to be optimized as part of the LSBT problem. We further divide the broadcast data into m
chunks. These data chunks can then be broadcast down the LSBT in a pipeline manner. In a
homogeneous network environment in which each node has the same upload capacity c, we show that
the optimal uplink rate r* of LSBT is either c/2 or c/3, whichever gives the smaller maximum
completion time. For heterogeneous environments, we present an O(nlog2n) algorithm to select an
optimal uplink rater* and to construct an optimal LSBT. Numerical results show that our approach
performs well with less maximum completion time and lower computational complexity than other
efficient solutions in literature.
ETPL
BD - 024 A Novel Pipeline Approach for Efficient Big Data Broadcasting
Given a point p and a set of points S, the kNN operation finds the k closest points to p in S. It is a
computational intensive task with a large range of applications such as knowledge discovery or data
mining. However, as the volume and the dimension of data increase, only distributed approaches can
perform such costly operation in a reasonable time. Recent works have focused on implementing
efficient solutions using the MapReduce programming model because it is suitable for distributed large
scale data processing. Although these works provide different solutions to the same problem, each one
has particular constraints and properties. In this paper, we compare the different existing approaches
for computing kNN on MapReduce, first theoretically, and then by performing an extensive
experimental evaluation. To be able to compare solutions, we identify three generic steps for kNN
computation on MapReduce: data pre-processing, data partitioning and computation. We then analyze
each step from load balancing, accuracy and complexity aspects. Experiments in this paper use a variety
of datasets, and analyze the impact of data volume, data dimension and the value of k from many
perspectives like time and space complexity, and accuracy. The experimental part brings new
advantages and shortcomings that are discussed for each algorithm. To the best of our knowledge, this
is the first paper that compares kNN computing methods on MapReduce both theoretically and
experimentally with the same setting. Overall, this paper can be used as a guide to tackle kNN-based
practical problems in the context of big data.
ETPL
BD - 025 K Nearest Neighbour Joins for Big Data on Map Reduce: a Theoretical
and Experimental Analysis
This paper introduces HadoopViz; a MapReduce-based framework for visualizing big spatial data.
HadoopViz has three unique features that distinguish it from other techniques. (1) It exposes an
extensible interface which allows users to define a new visualization types, e.g., scatter plot, road
network, or heat map, by defining five abstract functions, without delving into the implementation
details of the MapReduce algorithms. As it is open source, HadoopViz allows algorithm designers to
focus on how the data should be visualized rather than performance or scalability issues. (2) HadoopViz
is capable of generating big images with giga-pixel resolution by employing a three-phase technique,
partition-plot-merge. (3) HadoopViz provides a smoothing functionality which can fuse nearby records
together as the image is plotted. This makes it capable of generating more types of images with high
quality as compared to existing work. Experimental results on real datasets of up to 14 Billion points
show the extensibility, scalability, and efficiency of HadoopViz to handle different visualization types
of spatial big data.
ETPL
BD - 026 HadoopViz: A Map Reduce framework for extensible visualization of
big spatial data
The global deployment of cloud datacenters is enabling large scale scientific workflows to improve
performance and deliver fast responses. This unprecedented geographical distribution of the
computation is doubled by an increase in the scale of the data handled by such applications, bringing
new challenges related to the efficient data management across sites. High throughput, low latencies
or cost-related trade-offs are just a few concerns for both cloud providers and users when it comes to
handling data across datacenters. Existing solutions are limited to cloud-provided storage, which offers
low performance based on rigid cost schemes. In turn, workflow engines need to improvise substitutes,
achieving performance at the cost of complex system configurations, maintenance overheads, reduced
reliability and reusability. In this paper, we introduce OverFlow, a uniform data management system
for scientific workflows running across geographically distributed sites, aiming to reap economic
benefits from this geo-diversity. Our solution is environment-aware, as it monitors and models the
global cloud infrastructure, offering high and predictable data handling performance for transfer cost
and time, within and across sites. OverFlow proposes a set of pluggable services, grouped in a data
scientist cloud kit. They provide the applications with the possibility to monitor the underlying
infrastructure, to exploit smart data compression, deduplication and geo-replication, to evaluate data
management costs, to set a tradeoff between money and time, and optimize the transfer strategy
accordingly. The system was validated on the Microsoft Azure cloud across its 6 EU and US
datacenters. The experiments were conducted on hundreds of nodes using synthetic benchmarks and
real-life bio-informatics applications (A-Brain, BLAST). The results show that our system is able to
model accurately the cloud performance and to leverage this for efficient data dissemination, being
able to reduc- the monetary costs and transfer time by up to three times.
ETPL
BD - 027 OverFlow: Multi-Site Aware Big Data Management for Scientific
Workflows on Clouds
Trajectory data are prevalent in systems that monitor the locations of moving objects. In a location-
based service, for instance, the positions of vehicles are continuously monitored through GPS; the
trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories,
generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest
neighbors from R. We examine how this query can be evaluated in cloud environments. This problem
is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal
dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution
framework based on MapReduce. We also develop a novel bounding technique, which enables
trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine
trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have
performed extensive experiments on a real dataset.
ETPL
BD - 028 Scalable Algorithms for Nearest-Neighbour Joins on Big Trajectory Data
Crowdsourcing is a process of acquisition, integration, and analysis of big and heterogeneous data
generated by a diversity of sources in urban spaces, such as sensors, devices, vehicles, buildings, and
human. Especially, nowadays, no countries, no communities, and no person are immune to urban
emergency events. Detection about urban emergency events, e.g., fires, storms, traffic jams is of great
importance to protect the security of humans. Recently, social media feeds are rapidly emerging as a
novel platform for providing and dissemination of information that is often geographic. The content
from social media usually includes references to urban emergency events occurring at, or affecting
specific locations. In this paper, in order to detect and describe the real time urban emergency event,
the 5W (What, Where, When, Who, and Why) model is proposed. Firstly, users of social media are set
as the target of crowd sourcing. Secondly, the spatial and temporal information from the social media
are extracted to detect the real time event. Thirdly, a GIS based annotation of the detected urban
emergency event is shown. The proposed method is evaluated with extensive case studies based on
real urban emergency events. The results show the accuracy and efficiency of the proposed method.
ETPL
BD - 029 Crowdsourcing based Description of Urban Emergency Events using
Social Media Big Data
As new data and updates are constantly arriving, the results of data mining applications become stale
and obsolete over time. Incremental processing is a promising approach to refresh mining results. It
utilizes previously saved states to avoid the expense of re-computation from scratch. In this paper, we
propose i2MapReduce, a novel incremental processing extension to MapReduce. Compared with the
state-of-the-art work on Incoop, i2MapReduce (i) performs key-value pair level incremental processing
rather than task level re-computation, (ii) supports not only one-step computation but also more
sophisticated iterative computation, and (iii) incorporates a set of novel techniques to reduce I/O
overhead for accessing preserved fine-grain computation states. Experimental results on Amazon EC2
show significant performance improvements of i2MapReduce compared to both plain and iterative
MapReduce performing re-computation.
ETPL
BD - 030 i2MapReduce: Incremental map reduce for mining evolving big data
Trajectory data are prevalent in systems that monitor the locations of moving objects. In a location-
based service, for instance, the positions of vehicles are continuously monitored through GPS; the
trajectory of each vehicle describes its movement history. We study joins on two sets of trajectories,
generated by two sets M and R of moving objects. For each entity in M, a join returns its k nearest
neighbors from R. We examine how this query can be evaluated in cloud environments. This problem
is not trivial, due to the complexity of the trajectory, and the fact that both the spatial and temporal
dimensions of the data have to be handled. To facilitate this operation, we propose a parallel solution
framework based on MapReduce. We also develop a novel bounding technique, which enables
trajectories to be pruned in parallel. Our approach can be used to parallelize existing single-machine
trajectory join algorithms. To evaluate the efficiency and the scalability of our solutions, we have
performed extensive experiments on a real dataset.
ETPL
BD - 031 Scalable algorithms for nearest-neighbour joins on big trajectory data
Cloud data owners prefer to outsource documents in an encrypted form for the purpose of privacy
preserving. Therefore it is essential to develop efficient and reliable ciphertext search techniques. One
challenge is that the relationship between documents will be normally concealed in the process of
encryption, which will lead to significant search accuracy performance degradation. Also the volume
of data in data centers has experienced a dramatic growth. This will make it even more challenging to
design ciphertext search schemes that can provide efficient and reliable online information retrieval on
large volume of encrypted data. In this paper, a hierarchical clustering method is proposed to support
more search semantics and also to meet the demand for fast ciphertext search within a big data
environment. The proposed hierarchical approach clusters the documents based on the minimum
relevance threshold, and then partitions the resulting clusters into sub-clusters until the constraint on
the maximum size of cluster is reached. In the search phase, this approach can reach a linear
computational complexity against an exponential size increase of document collection. In order to
verify the authenticity of search results, a structure called minimum hash sub-tree is designed in this
paper. Experiments have been conducted using the collection set built from the IEEE Xplore. The
results show that with a sharp increase of documents in the dataset the search time of the proposed
method increases linearly whereas the search time of the traditional method increases exponentially.
Furthermore, the proposed method has an advantage over the traditional method in the rank privacy
and relevance of retrieved documents.
ETPL
BD - 032 An Efficient Privacy-Preserving Ranked Keyword Search Method