Elysium Technologies Private...
Transcript of Elysium Technologies Private...
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Mining high utility itemsets (HUIs) from databases is an important data mining task, which refers to the
discovery of itemsets with high utilities (e.g. high profits). However, it may present too many HUIs to users,
which also degrades the efficiency of the mining process. To achieve high efficiency for the mining task and
provide a concise mining result to users, we propose a novel framework in this paper for mining closed+ high
utility itemsets(CHUIs), which serves as a compact and lossless representation of HUIs. We propose three
efficient algorithms named AprioriCH (Apriori-based algorithm for mining High utility Closed+ itemsets),
AprioriHC-D (AprioriHC algorithm with Discarding unpromising and isolated items) and CHUD (Closed+
High Utility Itemset Discovery) to find this representation. Further, a method called DAHU (Derive All High
Utility Itemsets) is proposed to recover all HUIs from the set of CHUIs without accessing the original
database. Results on real and synthetic datasets show that the proposed algorithms are very efficient and that
our approaches achieve a massive reduction in the number of HUIs. In addition, when all HUIs can be
recovered by DAHU, the combination of CHUD and DAHU outperforms the state-of-the-art algorithms for
mining HUIs.
ETPL
DM - 001
Efficient Algorithms for Mining the Concise and Lossless Representation of
High Utility Item sets
In recent years, some authors have approached the instance selection problem from a meta-learning
perspective. In their work, they try to find relationships between the performance of some methods from this
field and the values of some data-complexity measures, with the aim of determining the best performing
method given a data set, using only the values of the measures computed on this data. Nevertheless, most of
the data-complexity measures existing in the literature were not conceived for this purpose and the feasibility
of their use in this field is yet to be determined. In this paper, we revise the definition of some measures that
we presented in a previous work, that were designed for meta-learning based instance selection. Also, we
assess them in an experimental study involving three sets of measures, 59 databases, 16 instance selection
methods, two classifiers, and eight regression learners used as meta-learners. The results suggest that our
measures are more efficient and effective than those traditionally used by researchers that have addressed the
instance selection from a perspective based on meta-learning.
.
ETPL
DM - 002
A Set of Complexity Measures Designed for Applying Meta-Learning to
Instance Selection
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Data Mining has wide applications in many areas such as banking, medicine, scientific research and among
government agencies. Classification is one of the commonly used tasks in data mining applications. For the
past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the
classification problem have been proposed under different security models. However, with the recent
popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as
well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-
preserving classification techniques are not applicable. In this paper, we focus on solving the classification
problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the
cloud. The proposed protocol protects the confidentiality of data, privacy of user's input query, and hides the
data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier
over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed
protocol using a real-world dataset under different parameter settings.
ETPL
DM - 003
k-Nearest Neighbor Classification over Semantically Secure Encrypted
Relational Data
As a newly emerging network model, heterogeneous information networks (HINs) have received growing
attention. Many data mining tasks have been explored in HINs, including clustering, classification, and
similarity search. Similarity join is a fundamental operation required for many problems. It is attracting
attention from various applications on network data, such as friend recommendation, link prediction, and
online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet
been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes
different semantic meanings behind paths into consideration and almost all completely ignore the
heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join)
method to return the top k similar pairs of objects based on any user specified join path in a heterogeneous
information network. We study how to prune expensive similarity computation by introducing bucket pruning
based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-
join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the
efficiency and effectiveness of the proposed approach
ETPL
DM - 004
Top-k Similarity Join in Heterogeneous Information Networks
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Numerous theories and algorithms have been developed to solve vectorial data learning problems by
searching for the hypothesis that best fits the observed training sample. However, many real-world
applications involve samples that are not described as feature vectors, but as (dis)similarity data.
Converting vectorial data into (dis)similarity data is more easily performed than converting
(dis)similarity data into vectorial data. This study proposes a stochastic iterative distance
transformation model for similarity-based learning. The proposed model can be used to identify a
clear class boundary in data by modifying the (dis)similarities between examples. The experimental
results indicate that the performance of the proposed method is comparable with those of various
vector-based and proximity-based learning algorithms.
ETPL
DM - 005
A Similarity-Based Learning Algorithm Using Distance Transformation
Many mature term-based or pattern-based approaches have been used in the field of information filtering to
generate users' information needs from a collection of documents. A fundamental assumption for these
approaches is that the documents in the collection are all about one topic. However, in reality users' interests
can be diverse and the documents in the collection often involve multiple topics. Topic modelling, such as
Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in
a collection of documents, and this has been widely utilized in the fields of machine learning and information
retrieval, etc. But its effectiveness in information filtering has not been so well explored. Patterns are always
thought to be more discriminative than single terms for describing documents. However, the enormous amount
of discovered patterns hinder them from being effectively and efficiently used in real applications, therefore,
selection of the most discriminative and representative patterns from the huge amount of discovered patterns
becomes crucial. To deal with the above mentioned limitations and problems, in this paper, a novel
information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is proposed. The main
distinctive features of the proposed model include: (1) user information needs are generated in terms of
multiple topics; (2) each topic is represented by patterns; (3) patterns are generated from topic models and are
organized in terms of their statistical and taxonomic features; and (4) the most discriminative and
representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance
to the user's information needs in order to filter out irrelevant documents. Extensive experiments are conducted
to evaluate the effectiveness of the proposed model by using the TREC data collection Reuters Corpus
Volume 1. The results show that the proposed model significantly - utperforms both state-of-the-art term-
ETPL
DM - 006
Pattern-based Topics for Document Modelling in Information Filtering
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Learning to rank arises in many data mining applications, ranging from web search engine, online advertising
to recommendation system. In learning to rank, the performance of a ranking model is strongly affected by the
number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data
is very expensive and time-consuming. This presents a great need for the active learning approaches to select
most informative examples for ranking learning; however, in the literature there is still very limited work to
address active learning for ranking. In this paper, we propose a general active learning framework, expected
loss optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions.
Under this framework, we derive a novel algorithm, expected discounted cumulative gain (DCG) loss
optimization (ELO-DCG), to select most informative examples. Then, we investigate both query and
document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate
both query and document selection into active learning. Furthermore, we show that it is flexible for the
algorithm to deal with the skewed grade distribution problem with the modification of the loss function.
Extensive experiments on real-world web search data sets have demonstrated great potential and effectiveness
of the proposed framework and algorithms.
ETPL
DM - 007
Active Learning for Ranking through Expected Loss Optimization
Due to its successful application in recommender systems, collaborative filtering (CF) has become a hot
research topic in data mining and information retrieval. In traditional CF methods, only the feedback matrix,
which contains either explicit feedback (also called ratings) or implicit feedback on the items given by users, is
used for training and prediction. Typically, the feedback matrix is sparse, which means that most users interact
with few items. Due to this sparsity problem, traditional CF with only feedback information will suffer from
unsatisfactory performance. Recently, many researchers have proposed to utilize auxiliary information, such
as item content (attributes), to alleviate the data sparsity problem in CF. Collaborative topic regression (CTR)
is one of these methods which has achieved promising performance by successfully integrating both feedback
information and item content information. In many real applications, besides the feedback and item content
information, there may exist relations (also known as networks) among the items which can be helpful for
recommendation. In this paper, we develop a novel hierarchical Bayesian model called Relational
Collaborative Topic Regression (RCTR), which extends CTR by seamlessly integrating the user-item
feedback information, item content information, and network structure among items into the same model.
Experiments on real-world datasets show that our model can achieve better prediction accuracy than the state-
ETPL
DM - 008
Relational Collaborative Topic Regression for Recommender Systems
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing
user preferences because of large scale terms and data patterns. Most existing popular text mining and
classification methods have adopted term-based approaches. However, they have all suffered from the
problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-
based methods should perform better than term-based ones in describing user preferences; yet, how to
effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this
challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both
positive and negative patterns in text documents as higher level features and deploys them over low-level
features (terms). It also classifies terms into categories and updates term weights based on their specificity and
their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-
21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods
and the pattern based methods.
ETPL
DM - 009
Relevance Feature Discovery for Text Mining
Given the proliferation of review content, and the fact that reviews are highly diverse and often unnecessarily
verbose, users frequently face the problem of selecting the appropriate reviews to consume. Micro-reviews are
emerging as a new type of online review content in the social media. Micro-reviews are posted by users of
check-in services such as Foursquare. They are concise (up to 200 characters long) and highly focused, in
contrast to the comprehensive and verbose reviews. In this paper, we propose a novel mining problem, which
brings together these two disparate sources of review content. Specifically, we use coverage of micro-reviews
as an objective for selecting a set of reviews that cover efficiently the salient aspects of an entity. Our
approach consists of a two-step process: matching review sentences to micro-reviews, and selecting a small set
of reviews that cover as many micro-reviews as possible, with few sentences. We formulate this objective as a
combinatorial optimization problem, and show how to derive an optimal solution using Integer Linear
Programming. We also propose an efficient heuristic algorithm that approximates the optimal solution.
Finally, we perform a detailed evaluation of all the steps of our methodology using data collected from
Foursquare and Yelp.
ETPL
DM - 010
Review Selection Using Micro-Reviews
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
problems: (1) similar rules can be rated quite differently, (2) rules may not be found because they are
individually considered uninteresting, and (3) rules that are too specific are less likely to be used for making
predictions. To address these issues, we explore the idea of mining “partially-ordered sequential rules”
(POSR), a more general form of sequential rules such that items in the antecedent and the consequent of each
rule are unordered. To mine POSR, we propose the RuleGrowth algorithm, which is efficient and easily
extendable. In particular, we present an extension (TRuleGrowth) that accepts a sliding-window constraint to
find rules occurring within a maximum amount of time. A performance study with four real-life datasets show
that RuleGrowth and TRuleGrowth have excellent performance and scalability compared to baseline
algorithms and that the number of rules discovered can be several orders of magnitude smaller when the
sliding-window constraint is applied. Furthermore, we also report results from a real application showing that
POSR can provide a much higher prediction accuracy than regular sequential rules for sequence prediction
ETPL
DM - 011
Mining Partially-Ordered Sequential Rules Common to Multiple Sequences
The problem of mobile sequential recommendation is to suggest a route connecting a set of pick-up points for
a taxi driver so that he/she is more likely to get passengers with less travel cost. Essentially, a key challenge of
this problem is its high computational complexity. In this paper, we propose a novel dynamic programming
based method to solve the mobile sequential recommendation problem consisting of two separate stages: an
offline pre-processing stage and an online search stage. The offline stage pre-computes potential candidate
sequences from a set of pick-up points. A backward incremental sequence generation algorithm is proposed
based on the identified iterative property of the cost function. Simultaneously, an incremental pruning policy is
adopted in the process of sequence generation to reduce the search space of the potential sequences
effectively. In addition, a batch pruning algorithm is further applied to the generated potential sequences to
remove some non-optimal sequences of a given length. Since the pruning effectiveness keeps growing with the
increase of the sequence length, at the online stage, our method can efficiently find the optimal driving route
for an unloaded taxi in the remaining candidate sequences. Moreover, our method can handle the problem of
optimal route search with a maximum cruising distance or a destination constraint. Experimental results on
real and synthetic data sets show that both the pruning ability and the efficiency of our method surpass the
state-of-the-art methods. Our techniques can therefore be effectively employed to address the problem of
mobile sequential recommendation with many pick-up points in real-world applications.
ETPL
DM - 012
Backward Path Growth for Efficient Mobile Sequential Recommendation
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Mining opinion targets and opinion words from online reviews are important tasks for fine-grained opinion
mining, the key component of which involves detecting opinion relations among words. To this end, this paper
proposes a novel approach based on the partially-supervised alignment model, which regards identifying
opinion relations as an alignment process. Then, a graph-based co-ranking algorithm is exploited to estimate
the confidence of each candidate. Finally, candidates with higher confidence are extracted as opinion targets or
opinion words. Compared to previous methods based on the nearest-neighbor rules, our model captures
opinion relations more precisely, especially for long-span relations. Compared to syntax-based methods, our
word alignment model effectively alleviates the negative effects of parsing errors when dealing with informal
online texts. In particular, compared to the traditional unsupervised alignment model, the proposed model
obtains better precision because of the usage of partial supervision. In addition, when estimating candidate
confidence, we penalize higher-degree vertices in our graph-based co-ranking algorithm to decrease the
probability of error generation. Our experimental results on three corpora with different sizes and languages
show that our approach effectively outperforms state-of-the-art methods.
ETPL
DM - 013
Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based
on the Word Alignment Model
Feature selection has been an important research topic in data mining, because the real data sets often have
highdimensional features, such as the bioinformatics and text mining applications. Many existing filter feature
selection methods rank features by optimizing certain feature ranking criterions, such that correlated features
often have similar rankings. These correlated features are redundant and don’t provide large mutual
information to help data mining. Thus, when we select a limit number of features, we hope to select the top
non-redundant features such that the useful mutual information can be maximized. In previous research, Ding
et al. recognized this important issue and proposed the mRMR (minimum Redundancy Maximum Relevance
Feature Selection) model to minimize the redundancy between sequentially selected features. However, this
method used the greedy search, thus the global feature redundancy wasn’t considered and the results are not
optimal. In this paper, we propose a new feature selection framework to globally minimize the feature
redundancy with maximizing the given feature ranking scores, which can come from any supervised or
unsupervised methods. Our new model has no parameter so that it is especially suitable for practical data
mining application. Experimental results on benchmark data sets show that the proposed method consistently
improves the feature selection results compared to the original methods. Meanwhile, we introduce a new
unsupervised global and local discriminative feature selection method which can be unified with the global
feature redundancy minimization framework and shows superior performance.
ETPL
DM - 014
Global Redundancy Minimization for Feature Ranking
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
There is an intense technological race underway to build the highest-performance and lowest-power custom
Bitcoin mining appliances using custom ASIC processors. This article describes the architecture and
implementation details of CoinTerra's first-generation Bitcoin mining processor, Goldstrike 1, and how this
processor is used to design a complete Bitcoin mining machine called Terraminer IV. Because of high power
density in the Bitcoin mining processor, delivering power and cooling the die posed enormous challenges.
This article describes some of the solutions adopted to overcome these challenges.
ETPL
DM – 015
Goldstrike 1: CoinTerra's First-Generation Cryptocurrency Mining Processor
for Bitcoin
In this study, the performance of two waterline extraction approaches is analyzed using dual-polarization
Cosmo-SkyMed (CSK) Synthetic Aperture Radar (SAR) data and ancillary ground truth information. The
single-polarization approach is based on multiscale normalized cuts segmentation; while, the dual-polarization
one exploits the inherent peculiarities of the CSK PING PONG incoherent dual-polarimetric imaging mode
together with a tailored scattering model to perform land/sea discrimination. The two approaches are applied
to the actual CSK SAR data collected over the coastal area of Shanghai, China. To provide a detailed and
complete validation of the two approaches, we carried out several field surveys collecting in situ ancillary
information including Global Positioning System (GPS) data and tidal information. Experimental results show
that 1) both approaches provide satisfactory results in extracting waterline from CSK SAR data in the
intertidal flat under low-to-moderate wind conditions and under a very broad range of incidence angles; 2) the
accuracy of the waterline extracted by both approaches decreases in case of water within the intertidal flat; 3)
the single-polarization approach is unsupervised when the land/sea contrast ratio is high. However, it needs
manual supervision to correct the extracted waterline when the land/sea contrast is low or in complex areas. A
typical CSK scene is processed in about 25 min; 4) the dual-polarization approach is unsupervised and very
effective: a typical CSK SAR scene is processed in seconds.
ETPL
DM - 016
Performance Analysis and Validation of Waterline Extraction Approaches
Using Single- and Dual-Polarimetric SAR Data
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Current tools that facilitate the extract-transform-load (ETL) process focus on ETL workflow, not on
generating meaningful semantic relationships to integrate data from multiple, heterogeneous sources. A
proposed semantic ETL framework applies semantics to various data fields and so allows richer data
integration.
ETPL
DM - 017
Integrating Big Data: A Semantic Extract-Transform-Load Framework
A description of patient conditions should consist of the changes in and combination of clinical
measures. Traditional data-processing method and classification algorithms might cause clinical
information to disappear and reduce prediction performance. To improve the accuracy of clinical-
outcome prediction by using multiple measurements, a new multiple-time-series data-processing
algorithm with period merging is proposed. Clinical data from 83 hepatocellular carcinoma (HCC)
patients were used in this research. Their clinical reports from a defined period were merged using the
proposed merging algorithm, and statistical measures were also calculated. After data processing,
multiple measurements support vector machine (MMSVM) with radial basis function (RBF) kernels
was used as a classification method to predict HCC recurrence. A multiple measurements random
forest regression (MMRF) was also used as an additional evaluation/classification method. To
evaluate the data-merging algorithm, the performance of prediction using processed multiple
measurements was compared to prediction using single measurements. The results of recurrence
prediction by MMSVM with RBF using multiple measurements and a period of 120 days (accuracy
0.771, balanced accuracy 0.603) were optimal, and their superiority to the results obtained using single
measurements was statistically significant (accuracy 0.626, balanced accuracy 0.459, P <; 0.01). In the
cases of MMRF, the prediction results obtained after applying the proposed merging algorithm were
also better than single-measurement results (P <; 0.05). The results show that the performance of
HCC-recurrence prediction was significantly improved when the proposed data-processing algorithm
was used, and that multiple measurements could be of greater value than single
ETPL
DM - 018
Multiple-Time-Series Clinical Data Processing for Classification With Merging
Algorithm and Statistical Measures
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
The increasing complexity and volume of data mandate tighter integration between analytics and visualization.
In this paper, I propose a pattern of integrating of computational and visual analytics techniques, called just-in-
time (JIT) interactive analytics. JIT analytics is performed in real-time on data that users are interacting with to
guide visual-analytic exploration. Fundamental to JIT analytics is enriching visualizations with annotations
that describe semantics of visual features, thereby suggesting to users possible insights to examine further. To
accomplish this, JIT analytics needs to 1) identify insights depicted as visual patterns such as clusters, outliers,
and trends in visualizations and 2) determine the semantics of such features by considering not only attributes
that are being visualized but also other attributes in data. In this paper, I describe the JIT interactive analytics
pattern, along with a generic implementation for any type of visualization and data, and provide a particular
implementation for point-based visualization of multivariate data. I argue that the pattern provides a useful
user experience by elevating the cognitive level of interaction with data from pure perception of visual
representations to understanding higher level semantics of data. As such, this supports users in building faster
qualitative mental models and accelerating discovery. Furthermore, facilitating insight opens new research
opportunities such as visual-analytic action recommendations, improved collaboration, and accessibility
ETPL
DM - 019
Just-in-time interactive analytics: Guiding visual exploration of data
In this paper, we propose a detection method based on data-driven target modeling, which implicitly handles
variations in the target appearance. Given a training set of images of the target, our approach constructs models
based on local neighborhoods within the training set. We present a new metric using these models and show
that, by controlling the notion of locality within the training set, this metric is invariant to perturbations in the
appearance of the target. Using this metric in a supervised graph framework, we construct a low-dimensional
embedding of test images. Then, a detection score based on the embedding determines the presence of a target
in each image. The method is applied to a data set of side-scan sonar images and achieves impressive results in
the detection of sea mines. The proposed framework is general and can be applied to different target detection
problems in a broad range of signals.
ETPL
DM - 020
Graph-Based Supervised Automatic Target Detection
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing
bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in
manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we
address the problem of data reduction for bug triage, i.e., how to reduce the scale and improve the quality of
bug data. We combine instance selection with feature selection to simultaneously reduce data scale on the bug
dimension and the word dimension. To determine the order of applying instance selection and feature
selection, we extract attributes from historical bug data sets and build a predictive model for a new bug data
set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large
open source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively
reduce the data scale and improve the accuracy of bug triage. Ourwork provides an approach to leveraging
techniques on data processing to form reduced and high-quality bug data in software development and
maintenance.
ETPL
DM - 021
Towards Effective Bug Triage with Software Data Reduction Techniques
Intelligently extracting knowledge from social media has recently attracted great interest from the Biomedical
and Health Informatics community to simultaneously improve healthcare outcomes and reduce costs using
consumer-generated opinion. We propose a two-step analysis framework that focuses on positive and negative
sentiment, as well as the side effects of treatment, in users' forum posts, and identifies user communities
(modules) and influential users for the purpose of ascertaining user opinion of cancer treatment. We used a
self-organizing map to analyze word frequency data derived from users' forum posts. We then introduced a
novel network-based approach for modeling users' forum interactions and employed a network partitioning
method based on optimizing a stability quality measure. This allowed us to determine consumer opinion and
identify influential users within the retrieved modules using information derived from both word-frequency
data and network-based properties. Our approach can expand research into intelligently mining social media
data for consumer opinion of various treatments to provide rapid, up-to-date information for the
pharmaceutical industry, hospitals, and medical staff, on the effectiveness (or ineffectiveness) of future
treatments.
ETPL
DM - 022
Network-Based Modeling and Intelligent Data Mining of Social Media for
Improving Care
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
We propose a novel approach for detecting precursors to epileptic seizures in intracranial
electroencephalograms (iEEGs), which is based on the analysis of system dynamics. In the proposed scheme,
the largest Lyapunov exponent (LLE) of wavelet entropy of the segmented EEG signals are considered as the
discriminating features. Such features are processed by a support vector machine classifier, whose outcomes
(the label and its probability for each LLE) are post-processed and fed into a novel decision function to
determine whether the corresponding segment of the EEG signal contains a precursor to an epileptic seizure.
The proposed scheme is applied to the Freiburg data set, and the results show that seizure precursors are
detected in a time frame that unlike other existing schemes is very much convenient to patients, with the
sensitivity of 100% and negligible false positive detection rates.
ETPL
DM - 023
Real-time mining of epileptic seizure precursors via nonlinear mapping and
dissimilarity features
Data Mining has wide applications in many areas such as banking, medicine, scientific research and among
government agencies. Classification is one of the commonly used tasks in data mining applications. For the past
decade, due to the rise of various privacy issues, many theoretical and practical solutions to the classification
problem have been proposed under different security models. However, with the recent popularity of cloud
computing, users now have the opportunity to outsource their data, in encrypted form, as well as the data mining
tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preserving classification
techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data.
In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol
protects the confidentiality of data, privacy of user's input query, and hides the data access patterns. To the best
of our knowledge, our work is the first to develop a secure k-NN classifier over encrypted data under the semi-
honest model. Also, we empirically analyze the efficiency of our proposed protocol using a real-world dataset
under different parameter settings.
ETPL
DM - 024
k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational
Data
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
As a newly emerging network model, heterogeneous information networks (HINs) have received growing
attention. Many data mining tasks have been explored in HINs, including clustering, classification, and
similarity search. Similarity join is a fundamental operation required for many problems. It is attracting
attention from various applications on network data, such as friend recommendation, link prediction, and
online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet
been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes
different semantic meanings behind paths into consideration and almost all completely ignore the
heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join)
method to return the top k similar pairs of objects based on any user specified join path in a heterogeneous
information network. We study how to prune expensive similarity computation by introducing bucket pruning
based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-
join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the
efficiency and effectiveness of the proposed approach.
ETPL
DM - 025
Top-k Similarity Join in Heterogeneous Information Networks
Ranking of association rules is currently an interesting topic in data mining and bioinformatics. The huge
number of evolved rules of items (or, genes) by association rule mining (ARM) algorithms makes confusion to
the decision maker. In this article, we propose a weighted rule-mining technique (say, RANWAR or rank-
based weighted association rule-mining) to rank the rules using two novel rule-interestingness measures, viz.,
rank-based weighted condensed support (wcs) and weighted condensed confidence (wcc) measures to bypass
the problem. These measures are basically depended on the rank of items (genes). Using the rank, we assign
weight to each item. RANWAR generates much less number of frequent itemsets than the state-of-the-art
association rule mining algorithms. Thus, it saves time of execution of the algorithm. We run RANWAR on
gene expression and methylation datasets. The genes of the top rules are biologically validated by Gene
Ontologies (GOs) and KEGG pathway analyses. Many top ranked rules extracted from RANWAR that hold
poor ranks in traditional Apriori, are highly biologically significant to the related diseases. Finally, the top
rules evolved from RANWAR, that are not in Apriori, are reported.
ETPL
DM - 026
RANWAR: Rank-Based Weighted Association Rule Mining From Gene
Expression and Methylation Data
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this paper, a new method for constructing decision trees for stream data is proposed. First a new splitting
criterion based on the misclassification error is derived. A theorem is proven showing that the best attribute
computed in considered node according to the available data sample is the same, with some high probability,
as the attribute derived from the whole infinite data stream. Next this result is combined with the splitting
criterion based on the Gini index. It is shown that such combination provides the highest accuracy among all
studied algorithms.
ETPL
DM - 027
A New Method for Data Stream Mining Based on the Misclassification Error
Learning to rank arises in many data mining applications, ranging from web search engine, online advertising
to recommendation system. In learning to rank, the performance of a ranking model is strongly affected by the
number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data
is very expensive and time-consuming. This presents a great need for the active learning approaches to select
most informative examples for ranking learning; however, in the literature there is still very limited work to
address active learning for ranking. In this paper, we propose a general active learning framework, expected
loss optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions.
Under this framework, we derive a novel algorithm, expected discounted cumulative gain (DCG) loss
optimization (ELO-DCG), to select most informative examples. Then, we investigate both query and
document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate
both query and document selection into active learning. Furthermore, we show that it is flexible for the
algorithm to deal with the skewed grade distribution problem with the modification of the loss function.
Extensive experiments on real-world web search data sets have demonstrated great potential and effectiveness
of the proposed framework and algorithms.
ETPL
DM - 028
Active Learning for Ranking through Expected Loss Optimization
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing
user preferences because of large scale terms and data patterns. Most existing popular text mining and
classification methods have adopted term-based approaches. However, they have all suffered from the
problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-
based methods should perform better than term-based ones in describing user preferences; yet, how to
effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this
challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both
positive and negative patterns in text documents as higher level features and deploys them over low-level
features (terms). It also classifies terms into categories and updates term weights based on their specificity and
their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-
21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods
and the pattern based methods.
ETPL
DM - 030
Relevance Feature Discovery for Text Mining
The classical fuzzy system modeling methods implicitly assume data generated from a single task, which is
essentially not in accordance with many practical scenarios where data can be acquired from the perspective of
multiple tasks. Although one can build an individual fuzzy system model for each task, the result indeed tells
us that the individual modeling approach will get poor generalization ability due to ignoring the intertask
hidden correlation. In order to circumvent this shortcoming, we consider a general framework for preserving
the independent information among different tasks and mining hidden correlation information among all tasks
in multitask fuzzy modeling. In this framework, a low-dimensional subspace (structure) is assumed to be
shared among all tasks and hence be the hidden correlation information among all tasks. Under this
framework, a multitask Takagi-Sugeno-Kang (TSK) fuzzy system model called MTCS-TSK-FS (TSK-FS for
multiple tasks with common hidden structure), based on the classical L2-norm TSK fuzzy system, is proposed
in this paper. The proposed model can not only take advantage of independent sample information from the
original space for each task, but also effectively use the intertask common hidden structure among multiple
tasks to enhance the generalization performance of the built fuzzy systems. Experiments on synthetic and real-
world datasets demonstrate the applicability and distinctive performance of the proposed multitask fuzzy
system model in multitask regression learning scenarios.
ETPL
DM - 031
Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden
Structure
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
The problem of mobile sequential recommendation is to suggest a route connecting a set of pick-up points for
a taxi driver so that he/she is more likely to get passengers with less travel cost. Essentially, a key challenge of
this problem is its high computational complexity. In this paper, we propose a novel dynamic programming
based method to solve the mobile sequential recommendation problem consisting of two separate stages: an
offline pre-processing stage and an online search stage. The offline stage pre-computes potential candidate
sequences from a set of pick-up points. A backward incremental sequence generation algorithm is proposed
based on the identified iterative property of the cost function. Simultaneously, an incremental pruning policy is
adopted in the process of sequence generation to reduce the search space of the potential sequences effectively.
In addition, a batch pruning algorithm is further applied to the generated potential sequences to remove some
non-optimal sequences of a given length. Since the pruning effectiveness keeps growing with the increase of
the sequence length, at the online stage, our method can efficiently find the optimal driving route for an
unloaded taxi in the remaining candidate sequences. Moreover, our method can handle the problem of optimal
route search with a maximum cruising distance or a destination constraint. Experimental results on real and
synthetic data sets show that both the pruning ability and the efficiency of our method surpass the state-of-the-
art methods. Our techniques can therefore be effectively employed to address the problem of mobile sequential
recommendation with many pick-up points in real-world applications.
ETPL
DM - 032
Backward Path Growth for Efficient Mobile Sequential Recommendation
Feature selection has been an important research topic in data mining, because the real data sets often have
highdimensional features, such as the bioinformatics and text mining applications. Many existing filter feature
selection methods rank features by optimizing certain feature ranking criterions, such that correlated features
often have similar rankings. These correlated features are redundant and don’t provide large mutual
information to help data mining. Thus, when we select a limit number of features, we hope to select the top
non-redundant features such that the useful mutual information can be maximized. In previous research, Ding
et al. recognized this important issue and proposed the mRMR (minimum Redundancy Maximum Relevance
Feature Selection) model to minimize the redundancy between sequentially selected features. However, this
method used the greedy search, thus the global feature redundancy wasn’t considered and the results are not
optimal. In this paper, we propose a new feature selection framework to globally minimize the feature
redundancy with maximizing the given feature ranking scores, which can come from any supervised or
unsupervised methods. Our new model has no parameter so that it is especially suitable for practical data
ETPL
DM - 033
Global Redundancy Minimization for Feature Ranking
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Data Mining has wide applications in many areas such as banking, medicine, scientific research and among
government agencies. Classification is one of the commonly used tasks in data mining applications. For the
past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the
classification problem have been proposed under different security models. However, with the recent
popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as
well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-
preserving classification techniques are not applicable. In this paper, we focus on solving the classification
problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the
cloud. The proposed protocol protects the confidentiality of data, privacy of user's input query, and hides the
data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier
over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed
protocol using a real-world dataset under different parameter settings.
ETPL
DM - 035
Secure k -NN Classification over Semantically Secure Encrypted Relational
Data
Recently, two ideas have been explored that lead to more accurate algorithms for time-series classification
(TSC). First, it has been shown that the simplest way to gain improvement on TSC problems is to transform
into an alternative data space where discriminatory features are more easily detected. Second, it was
demonstrated that with a single data representation, improved accuracy can be achieved through simple
ensemble schemes. We combine these two principles to test the hypothesis that forming a collective of
ensembles of classifiers on different data transformations improves the accuracy of time-series classification.
The collective contains classifiers constructed in the time, frequency, change, and shapelet transformation
domains. For the time domain we use a set of elastic distance measures. For the other domains we use a range
of standard classifiers. Through extensive experimentation on 72 datasets, including all of the 46 UCR
datasets, we demonstrate that the simple collective formed by including all classifiers in one ensemble is
significantly more accurate than any of its components and any other previously published TSC algorithm. We
investigate alternative hierarchical collective structures and demonstrate the utility of the approach on a new
problem involving classifying Caenorhabditis elegans mutant types.
ETPL
DM - 036
Time-Series Classification with COTE: The Collective of Transformation-Based
Ensembles
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Recommender systems are promising for providing personalized favorite services. Collaborative filtering (CF)
technologies, making prediction of users’ preference based on users’ previous behaviors, have become one of
the most successful techniques to build modern recommender systems. Several challenging issues occur in
previously proposed CF methods: 1) most CF methods ignore users’ response patterns and may yield biased
parameter estimation and suboptimal performance; 2) some CF methods adopt heuristic weight settings, which
lacks a systematical implementation; 3) the multinomial mixture models may weaken the computational
ability of matrix factorization for generating the data matrix, thus increasing the computational cost of
training. To resolve these issues, we incorporate users’ response models into the probabilistic matrix
factorization (PMF), a popular matrix factorization CF model, to establish the Response Aware Probabilistic
Matrix Factorization (RAPMF) framework. More specifically, we make the assumption on the user response
as a Bernoulli distribution which is parameterized by the rating scores for the observed ratings while as a step
function for the unobserved ratings. Moreover, we speed up the algorithm by a mini-batch implementation and
a crafting scheduling policy. Finally, we design different experimental protocols and conduct systematical
empirical evaluation on both synthetic and real-world datasets to demonstrate the merits of the proposed
RAPMF and its mini-batch implementation.
ETPL
DM - 037
Boosting Response Aware Model-Based Collaborative Filtering
Over the past decade or so, several research groups have addressed the problem of multi-label classification
where each example can belong to more than one class at the same time. A common approach, called Binary
Relevance (BR), addresses this problem by inducing a separate classifier for each class. Research has shown
that this framework can be improved if mutual class dependence is exploited: an example that belongs to class
X is likely to belong also to class Y ; conversely, belonging to X can make an example less likely to belong to
Z. Several works sought to model this information by using the vector of class labels as additional example
attributes. To fill the unknown values of these attributes during prediction, existing methods resort to using
outputs of other classifiers, and this makes them prone to errors. This is where our paper wants to contribute.
We identified two potential ways to prune unnecessary dependencies and to reduce error-propagation in our
new classifier-stacking technique, which is named PruDent. Experimental results indicate that the
classification performance of PruDent compares favorably with that of other state-of-the-art approaches over a
broad range of testbeds. Moreover, its computational costs grow only linearly in the number of classes.
ETPL
DM - 038
PruDent: A Pruned and Confident Stacking Approach for Multi-label
Classification
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Recently, there has been a growing interest in designing differentially private data mining algorithms.
Frequent itemset mining (FIM) is one of the most fundamental problems in data mining. In this paper, we
explore the possibility of designing a differentially private FIM algorithm which can not only achieve high
data utility and a high degree of privacy, but also offer high time efficiency. To this end, we propose a
differentially private FIM algorithm based on the FP-growth algorithm, which is referred to as PFP-growth.
The PFP-growth algorithm consists of a preprocessing phase and a mining phase. In the preprocessing phase,
to improve the utility and privacy tradeoff, a novel smart splitting method is proposed to transform the
database. For a given database, the preprocessing phase needs to be performed only once. In the mining phase,
to offset the information loss caused by transaction splitting, we devise a run-time estimation method to
estimate the actual support of itemsets in the original database. In addition, by leveraging the downward
closure property, we put forward a dynamic reduction method to dynamically reduce the amount of noise
added to guarantee privacy during the mining process. Through formal privacy analysis, we show that our
PFP-growth algorithm is -differentially private. Extensive experiments on real datasets illustrate that our PFP-
growth algorithm substantially outperforms the state-of-the-art techniques.
ETPL
DM - 039
Differentially Private Frequent Itemset Mining via Transaction Splitting
Given a spatio-temporal network, a source, a destination, and a desired departure time interval, the All-
departuretime Lagrangian Shortest Paths (ALSP) problem determines a set which includes the shortest path for
every departure time in the given interval. ALSP is important for critical societal applications such as eco-
routing. However, ALSP is computationally challenging due to the non-stationary ranking of the candidate
paths across distinct departure-times. Current related work for reducing the redundant work, across consecutive
departure-times sharing a common solution, exploits only partial information e.g., the earliest feasible arrival
time of a path. In contrast, our approach uses all available information, e.g., the entire time series of arrival
times for all departure-times. This allows elimination of all knowable redundant computation based on complete
information available at hand. We operationalize this idea through the concept of critical-time-points (CTP),
i.e., departure-times before which ranking among candidate paths cannot change. In our preliminary work, we
proposed a CTP based forward search strategy. In this paper we propose a CTP based temporal bi-directional
search for the ALSP problem via a novel impromptu rendezvous termination condition. Theoretical and
experimental analysis show that the proposed approach outperforms the related work approaches particularly
when there are few critical-time-points
ETPL
DM - 040
A Critical-time-point Approach to All-departure-time Lagrangian Shortest
Paths
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Numerical typing errors can lead to serious consequences, but various causes of human errors and the lack of
contextual clues in numerical typing make their prediction difficult. Human behavior modeling can predict
the general tendency in making errors, while data mining can recognize neurophysiological feedback in
detecting cognitive abnormality on a trial-by-trial basis. This study suggests integrating human behavior
modeling and data mining to predict human errors because it utilizes both 1) top-down inference to transform
interactions between task characteristics and conditions into a general inclination of an average operator to
make errors and 2) bottom-up analysis in parsing psychophysiological measurements into an individual's
likelihood of making errors on a trial-by-trial basis. Real-time electroencephalograph (EEG) features
collected in a numerical typing experiment and modeling features produced by an enhanced human behavior
model (queuing network model human processor) were combined to improve error classification
performance by a linear discriminant analysis (LDA) classifier. Integrating EEG and modeling features
improved the results of LDA classification by 28.3% in keenness (d') and by 10.7% in the area under ROC
curve (AUC) from that of using EEG only; it also outperformed the other three benchmarking scenarios:
using behaviors only, using apparent task features, and using task features plus trial information. The AUC
was significantly increased from using EEG along only if EEG + Model features were used.
ETPL
DM - 041 Integrating Human Behavior Modeling and Data Mining Techniques to Predict
Human Errors in Numerical Typing
Intelligently extracting knowledge from social media has recently attracted great interest from the
Biomedical and Health Informatics community to simultaneously improve healthcare outcomes and reduce
costs using co nsumer-generated opinion. We propose a two-step analysis framework that focuses on positive
and negative sentiment, as well as the side effects of treatment, in users' forum posts, and identifies user
communities (modules) and influential users for the purpose of ascertaining user opinion of cancer treatment.
We used a self-organizing map to analyze word frequency data derived from users' forum posts. We then
introduced a novel network-based approach for modeling users' forum interactions and employed a network
partitioning method based on optimizing a stability quality measure. This allowed us to determine consumer
opinion and identify influential users within the retrieved modules using information derived from both
word-frequency data and network-based properties. Our approach can expand research into intelligently
mining social media data for consumer opinion of various treatments to provide rapid, up-to-date information
for the pharmaceutical industry, hospitals, and medical staff, on the effectiveness (or ineffectiveness) of
future treatments.
ETPL
DM - 042 Network-Based Modeling and Intelligent Data Mining of Social Media for
Improving Care
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
A novel data mining method was developed to gauge the experience of the drug Sitagliptin (trade name
Januvia) by patients with diabetes mellitus type 2. To this goal, we devised a two-step analysis framework.
Initial exploratory analysis using self-organizing maps was performed to determine structures based on user
opinions among the forum posts. The results were a compilation of user's clusters and their correlated (positive
or negative) opinion of the drug. Subsequent modeling using network analysis methods was used to determine
influential users among the forum members. These findings can open new avenues of research into rapid data
collection, feedback, and analysis that can enable improved outcomes and solutions for public health and
important feedback for the manufacturer.
ETPL
DM - 043 A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer
Opinion of Sitagliptin
The High Efficiency Video Coding standard provides improved compression ratio in comparison with its
predecessors at the cost of large increases in the encoding computational complexity. An important share of
this increase is due to the new flexible partitioning structures, namely the coding trees, the prediction units,
and the residual quadtrees, with the best configurations decided through an exhaustive rate-distortion
optimization (RDO) process. In this paper, we propose a set of procedures for deciding whether the partition
structure optimization algorithm should be terminated early or run to the end of an exhaustive search for the
best configuration. The proposed schemes are based on decision trees obtained through data mining
techniques. By extracting intermediate data, such as encoding variables from a training set of video sequences,
three sets of decision trees are built and implemented to avoid running the RDO algorithm to its full extent.
When separately implemented, these schemes achieve average computational complexity reductions (CCRs)
of up to 50% at a negligible cost of 0.56% in terms of Bjontegaard Delta (BD) rate increase. When the
schemes are jointly implemented, an average CCR of up to 65% is achieved, with a small BD-rate increase of
1.36%. Extensive experiments and comparisons with similar works demonstrate that the proposed early
termination schemes achieve the best rate-distortion-complexity tradeoffs among all the compared works.
ETPL
DM - 044
Fast HEVC Encoding Decisions Using Data Mining
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Wind energy integration research generally relies on complex sensors located at remote sites. The procedure
for generating high-level synthetic information from databases containing large amounts of low-level data
must therefore account for possible sensor failures and imperfect input data. The data input is highly sensitive
to data quality. To address this problem, this paper presents an empirical methodology that can efficiently
preprocess and filter the raw wind data using only aggregated active power output and the corresponding wind
speed values at the wind farm. First, raw wind data properties are analyzed, and all the data are divided into
six categories according to their attribute magnitudes from a statistical perspective. Next, the weighted
distance, a novel concept of the degree of similarity between the individual objects in the wind database and
the local outlier factor (LOF) algorithm, is incorporated to compute the outlier factor of every individual
object, and this outlier factor is then used to assess which category an object belongs to. Finally, the
methodology was tested successfully on the data collected from a large wind farm in northwest China.
ETPL
DM - 045
Raw Wind Data Preprocessing: A Data-Mining Approach
This work deals with the problem of producing a fast and accurate data classification, learning it from a
possibly small set of records that are already classified. The proposed approach is based on the framework of
the so-called Logical Analysis of Data (LAD), but enriched with information obtained from statistical
considerations on the data. A number of discrete optimization problems are solved in the different steps of the
procedure, but their computational demand can be controlled. The accuracy of the proposed approach is
compared to that of the standard LAD algorithm, of Support Vector Machines and of Label Propagation
algorithm on publicly available datasets of the UCI repository. Encouraging results are obtained and discussed
ETPL
DM - 046
Effective Classification using a small Training Set based on Discretization and
Statistical Analysis
E-healthcare systems have been increasingly facilitating health condition monitoring, disease modeling and
early intervention, and evidence-based medical treatment by medical text mining and image feature extraction.
Owing to the resource constraint of wearable mobile devices, it is required to outsource the frequently
collected personal health information (PHI) into the cloud. Unfortunately, delegating both storage and
computation to the untrusted entity would bring a series of security and privacy issues. The existing work
mainly focused on fine-grained privacy-preserving static medical text access and analysis, which can hardly
afford the dynamic health condition fluctuation and medical image analysis. In this paper, a secure and
ETPL
DM - 047
PPDM: Privacy-preserving Protocol for Dynamic Medical Text Mining and Image
Feature Extraction from Secure Data Aggregation in Cloud-assisted e-Healthcare
Systems
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Currently, most computer systems use user IDs and passwords as the login patterns to authenticate users.
However, many people share their login patterns with coworkers and request these coworkers to assist co-
tasks, thereby making the pattern as one of the weakest points of computer security. Insider attackers, the valid
users of a system who attack the system internally, are hard to detect since most intrusion detection systems
and firewalls identify and isolate malicious behaviors launched from the outside world of the system only. In
addition, some studies claimed that analyzing system calls (SCs) generated by commands can identify these
commands, with which to accurately detect attacks, and attack patterns are the features of an attack. Therefore,
in this paper, a security system, named the Internal Intrusion Detection and Protection System (IIDPS), is
proposed to detect insider attacks at SC level by using data mining and forensic techniques. The IIDPS creates
users' personal profiles to keep track of users' usage habits as their forensic features and determines whether a
valid login user is the account holder or not by comparing his/her current computer usage behaviors with the
patterns collected in the account holder's personal profile. The experimental results demonstrate that the
IIDPS's user identification accuracy is 94.29%, whereas the response time is less than 0.45 s, implying that it
can prevent a protected system from insider attacks effectively and efficiently.
ETPL
DM - 048
An Internal Intrusion Detection and Protection System by Using Data Mining
and Forensic Techniques
In this paper, a multiple classifier machine learning (ML) methodology for predictive maintenance (PdM) is
presented. PdM is a prominent strategy for dealing with maintenance issues given the increasing need to
minimize downtime and associated costs. One of the challenges with PdM is generating the so-called “health
factors,” or quantitative indicators, of the status of a system associated with a given maintenance issue, and
determining their relationship to operating costs and failure risk. The proposed PdM methodology allows
dynamical decision rules to be adopted for maintenance management, and can be used with high-dimensional
and censored data problems. This is achieved by training multiple classification modules with different
prediction horizons to provide different performance tradeoffs in terms of frequency of unexpected breaks and
unexploited lifetime, and then employing this information in an operating cost-based maintenance decision
system to minimize expected costs. The effectiveness of the methodology is demonstrated using a simulated
example and a benchmark semiconductor manufacturing maintenance problem.
ETPL
DM- 049
Machine Learning for Predictive Maintenance: A Multiple Classifier Approach
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
Location-based sequential event prediction is an interesting problem with many real-world applications. For
example, knowing when and where people will use certain kinds of services could enable the development of
robust anticipatory systems. A key to this problem is in understanding the nature of the process from which
sequential data arises. Usually, human behavior exhibits distinct spatial, temporal, and social patterns. The
authors examine three kinds of patterns extracted from sequential purchasing events and propose a novel
model that captures contextual dependencies in spatial sequence, customers' temporal preferences, and social
influence via an implicit network. Their model outperforms existing models based on evaluations using a real-
world dataset of smartcard transaction records from a large educational institution with 13,753 students during
a 10-month time period
ETPL
DM - 050
Predicting Location-Based Sequential Purchasing Events by Using Spatial, Temporal,
and Social Patterns
There is an unprecedented trend that content providers (CPs) are building their own content delivery networks
(CDNs) to provide a variety of content services to their users. By exploiting powerful CP-level information in
content distribution, these CP-built CDNs open up a whole new design space and are changing the content
delivery landscape. In this paper, we adopt a measurement-based approach to understanding why, how, and
how much CP-level intelligences can help content delivery. We first present a measurement study of the CDN
built by Tencent, a largest content provider based in China. We observe new characteristics and trends in
content delivery which pose great challenges to the conventional content delivery paradigm and motivate the
proposal of CPCDN, a CDN powered by CP-aware information. We then reveal the benefits obtained by
exploiting two indispensable CP-level intelligences, namely context intelligence and user intelligence, in
content delivery. Inspired by the insights learnt from the measurement studies, we systematically explore the
design space of CPCDN and present the novel architecture and algorithms to address the new content delivery
challenges that have arisen. Our results not only demonstrate the potential of CPCDN in pushing content
delivery performance to the next level, but also identify new research problems calling for further
investigation.
ETPL
DM - 051
CPCDN: Content Delivery Powered by Context and User Intelligence
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In this paper, a new method for constructing decision trees for stream data is proposed. First a new splitting
criterion based on the misclassification error is derived. A theorem is proven showing that the best attribute
computed in considered node according to the available data sample is the same, with some high probability,
as the attribute derived from the whole infinite data stream. Next this result is combined with the splitting
criterion based on the Gini index. It is shown that such combination provides the highest accuracy among all
studied algorithms.
ETPL
DM - 052
A New Method for Data Stream Mining Based on the Misclassification Error
This paper first introduces pattern aided regression (PXR) models, a new type of regression models designed to
represent accurate and interpretable prediction models. This was motivated by two observations: (1)
Regression modeling applications often involve complex diverse predictor-response relationships, which occur
when the optimal regression models (of given regression model type) fitting two or more distinct logical
groups of data are highly different. (2) State-of-the-art regression methods are often unable to adequately
model such relationships. This paper defines PXR models using several patterns and local regression models,
which respectively serve as logical and behavioral characterizations of distinct predictor-response
relationships. The paper also introduces a contrast pattern aided regression (CPXR) method, to build accurate
PXR models. In experiments, the PXR models built by CPXR are very accurate in general, often
outperforming state-of-the-art regression methods by big margins. Usually using (a) around seven simple
patterns and (b) linear local regression models, those PXR models are easy to interpret; in fact, their
complexity is just a bit higher than that of (piecewise) linear regression models and is significantly lower than
that of traditional ensemble based regression models. CPXR is especially effective for high-dimensional data.
The paper also discusses how to use CPXR methodology for analyzing prediction models and correcting their
prediction errors.
ETPL
DM - 053
Pattern-Aided Regression Modeling and Prediction Model Analysis
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
In collaborative environments, members may try to acquire similar information on the Web in order to gain
knowledge in one domain. For example, in a company several departments may successively need to buy
business intelligence software and employees from these departments may have studied online about different
business intelligence tools and their features independently. It will be productive to get them connected and
share learned knowledge. We investigate fine-grained knowledge sharing in collaborative environments. We
propose to analyze members’ Web surfing data to summarize the fine-grained knowledge acquired by them. A
two-step framework is proposed for mining fine-grained knowledge: (1) Web surfing data is clustered into
tasks by a nonparametric generative model; (2) a novel discriminative infinite Hidden Markov Model is
developed to mine fine-grained aspects in each task. Finally, the classic expert search method is applied to the
mined results to find proper members for knowledge sharing. Experiments on Web surfing data collected from
our lab at UCSB and IBM show that the finegrained aspect mining framework works as expected and
outperforms baselines. When it is integrated with expert search, the search accuracy improves significantly, in
comparison with applying the classic expert search method directly on Web surfing data.
ETPL
DM - 054
Fine-Grained Knowledge Sharing in Collaborative Environments
High utility sequential pattern mining has been considered as an important research problem and a number of
relevant algorithms have been proposed for this topic. The main challenge of high utility sequential pattern
mining is that, the search space is large and the efficiency of the solutions is directly affected by the degree at
which they can eliminate the candidate patterns. Therefore, the efficiency of any high utility sequential pattern
mining solution depends on its ability to reduce this big search space, and as a result, lower the computational
complexity of calculating the utilities of the candidate patterns. In this paper, we propose efficient data
structures and pruning technique which is based on Cumulated Rest of Match (CRoM) based upper bound.
CRoM, by defining a tighter upper bound on the utility of the candidates, allows more conservative pruning
before candidate pattern generation in comparison to the existing techniques. In addition, we have developed
an efficient algorithm, HuspExt (High Utility Sequential Pattern Extraction), which calculates the utilities of
the child patterns based on that of the parents’. Substantial experiments on both synthetic and real datasets
from different domains show that, the proposed solution efficiently discovers high utility sequential patterns
from large scale datasets with different data characteristics, under low utility thresholds.
ETPL
DM - 055
CRoM and HuspExt: Improving Efficiency of High Utility Sequential Pattern
Extraction
Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|
Sivakasi |Dindugul|
http://www.elysiumtechnologies.com, [email protected]
This paper addresses the problem of keyword extraction from conversations, with the goal of using these
keywords to retrieve, for each short conversation fragment, a small number of potentially relevant documents,
which can be recommended to participants. However, even a short fragment contains a variety of words,
which are potentially related to several topics; moreover, using an automatic speech recognition (ASR) system
introduces errors among them. Therefore, it is difficult to infer precisely the information needs of the
conversation participants. We first propose an algorithm to extract keywords from the output of an ASR
system (or a manual transcript for testing), which makes use of topic modeling techniques and of a submodular
reward function which favors diversity in the keyword set, to match the potential diversity of topics and reduce
ASR noise. Then, we propose a method to derive multiple topically separated queries from this keyword set, in
order to maximize the chances of making at least one relevant recommendation when using these queries to
search over the English Wikipedia. The proposed methods are evaluated in terms of relevance with respect to
conversation fragments from the Fisher, AMI, and ELEA conversational corpora, rated by several human
judges. The scores show that our proposal improves over previous methods that consider only word frequency
or topic similarity, and represents a promising solution for a document recommender system to be used in
conversations.
ETPL
DM - 056
Keyword Extraction and Clustering for Document Recommendation in
Conversations
Thank You !