Elysium Technologies Private...

30

Transcript of Elysium Technologies Private...

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Mining high utility itemsets (HUIs) from databases is an important data mining task, which refers to the

discovery of itemsets with high utilities (e.g. high profits). However, it may present too many HUIs to users,

which also degrades the efficiency of the mining process. To achieve high efficiency for the mining task and

provide a concise mining result to users, we propose a novel framework in this paper for mining closed+ high

utility itemsets(CHUIs), which serves as a compact and lossless representation of HUIs. We propose three

efficient algorithms named AprioriCH (Apriori-based algorithm for mining High utility Closed+ itemsets),

AprioriHC-D (AprioriHC algorithm with Discarding unpromising and isolated items) and CHUD (Closed+

High Utility Itemset Discovery) to find this representation. Further, a method called DAHU (Derive All High

Utility Itemsets) is proposed to recover all HUIs from the set of CHUIs without accessing the original

database. Results on real and synthetic datasets show that the proposed algorithms are very efficient and that

our approaches achieve a massive reduction in the number of HUIs. In addition, when all HUIs can be

recovered by DAHU, the combination of CHUD and DAHU outperforms the state-of-the-art algorithms for

mining HUIs.

ETPL

DM - 001

Efficient Algorithms for Mining the Concise and Lossless Representation of

High Utility Item sets

In recent years, some authors have approached the instance selection problem from a meta-learning

perspective. In their work, they try to find relationships between the performance of some methods from this

field and the values of some data-complexity measures, with the aim of determining the best performing

method given a data set, using only the values of the measures computed on this data. Nevertheless, most of

the data-complexity measures existing in the literature were not conceived for this purpose and the feasibility

of their use in this field is yet to be determined. In this paper, we revise the definition of some measures that

we presented in a previous work, that were designed for meta-learning based instance selection. Also, we

assess them in an experimental study involving three sets of measures, 59 databases, 16 instance selection

methods, two classifiers, and eight regression learners used as meta-learners. The results suggest that our

measures are more efficient and effective than those traditionally used by researchers that have addressed the

instance selection from a perspective based on meta-learning.

.

ETPL

DM - 002

A Set of Complexity Measures Designed for Applying Meta-Learning to

Instance Selection

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Data Mining has wide applications in many areas such as banking, medicine, scientific research and among

government agencies. Classification is one of the commonly used tasks in data mining applications. For the

past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the

classification problem have been proposed under different security models. However, with the recent

popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as

well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-

preserving classification techniques are not applicable. In this paper, we focus on solving the classification

problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the

cloud. The proposed protocol protects the confidentiality of data, privacy of user's input query, and hides the

data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier

over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed

protocol using a real-world dataset under different parameter settings.

ETPL

DM - 003

k-Nearest Neighbor Classification over Semantically Secure Encrypted

Relational Data

As a newly emerging network model, heterogeneous information networks (HINs) have received growing

attention. Many data mining tasks have been explored in HINs, including clustering, classification, and

similarity search. Similarity join is a fundamental operation required for many problems. It is attracting

attention from various applications on network data, such as friend recommendation, link prediction, and

online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet

been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes

different semantic meanings behind paths into consideration and almost all completely ignore the

heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join)

method to return the top k similar pairs of objects based on any user specified join path in a heterogeneous

information network. We study how to prune expensive similarity computation by introducing bucket pruning

based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-

join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the

efficiency and effectiveness of the proposed approach

ETPL

DM - 004

Top-k Similarity Join in Heterogeneous Information Networks

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Numerous theories and algorithms have been developed to solve vectorial data learning problems by

searching for the hypothesis that best fits the observed training sample. However, many real-world

applications involve samples that are not described as feature vectors, but as (dis)similarity data.

Converting vectorial data into (dis)similarity data is more easily performed than converting

(dis)similarity data into vectorial data. This study proposes a stochastic iterative distance

transformation model for similarity-based learning. The proposed model can be used to identify a

clear class boundary in data by modifying the (dis)similarities between examples. The experimental

results indicate that the performance of the proposed method is comparable with those of various

vector-based and proximity-based learning algorithms.

ETPL

DM - 005

A Similarity-Based Learning Algorithm Using Distance Transformation

Many mature term-based or pattern-based approaches have been used in the field of information filtering to

generate users' information needs from a collection of documents. A fundamental assumption for these

approaches is that the documents in the collection are all about one topic. However, in reality users' interests

can be diverse and the documents in the collection often involve multiple topics. Topic modelling, such as

Latent Dirichlet Allocation (LDA), was proposed to generate statistical models to represent multiple topics in

a collection of documents, and this has been widely utilized in the fields of machine learning and information

retrieval, etc. But its effectiveness in information filtering has not been so well explored. Patterns are always

thought to be more discriminative than single terms for describing documents. However, the enormous amount

of discovered patterns hinder them from being effectively and efficiently used in real applications, therefore,

selection of the most discriminative and representative patterns from the huge amount of discovered patterns

becomes crucial. To deal with the above mentioned limitations and problems, in this paper, a novel

information filtering model, Maximum matched Pattern-based Topic Model (MPBTM), is proposed. The main

distinctive features of the proposed model include: (1) user information needs are generated in terms of

multiple topics; (2) each topic is represented by patterns; (3) patterns are generated from topic models and are

organized in terms of their statistical and taxonomic features; and (4) the most discriminative and

representative patterns, called Maximum Matched Patterns, are proposed to estimate the document relevance

to the user's information needs in order to filter out irrelevant documents. Extensive experiments are conducted

to evaluate the effectiveness of the proposed model by using the TREC data collection Reuters Corpus

Volume 1. The results show that the proposed model significantly - utperforms both state-of-the-art term-

ETPL

DM - 006

Pattern-based Topics for Document Modelling in Information Filtering

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Learning to rank arises in many data mining applications, ranging from web search engine, online advertising

to recommendation system. In learning to rank, the performance of a ranking model is strongly affected by the

number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data

is very expensive and time-consuming. This presents a great need for the active learning approaches to select

most informative examples for ranking learning; however, in the literature there is still very limited work to

address active learning for ranking. In this paper, we propose a general active learning framework, expected

loss optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions.

Under this framework, we derive a novel algorithm, expected discounted cumulative gain (DCG) loss

optimization (ELO-DCG), to select most informative examples. Then, we investigate both query and

document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate

both query and document selection into active learning. Furthermore, we show that it is flexible for the

algorithm to deal with the skewed grade distribution problem with the modification of the loss function.

Extensive experiments on real-world web search data sets have demonstrated great potential and effectiveness

of the proposed framework and algorithms.

ETPL

DM - 007

Active Learning for Ranking through Expected Loss Optimization

Due to its successful application in recommender systems, collaborative filtering (CF) has become a hot

research topic in data mining and information retrieval. In traditional CF methods, only the feedback matrix,

which contains either explicit feedback (also called ratings) or implicit feedback on the items given by users, is

used for training and prediction. Typically, the feedback matrix is sparse, which means that most users interact

with few items. Due to this sparsity problem, traditional CF with only feedback information will suffer from

unsatisfactory performance. Recently, many researchers have proposed to utilize auxiliary information, such

as item content (attributes), to alleviate the data sparsity problem in CF. Collaborative topic regression (CTR)

is one of these methods which has achieved promising performance by successfully integrating both feedback

information and item content information. In many real applications, besides the feedback and item content

information, there may exist relations (also known as networks) among the items which can be helpful for

recommendation. In this paper, we develop a novel hierarchical Bayesian model called Relational

Collaborative Topic Regression (RCTR), which extends CTR by seamlessly integrating the user-item

feedback information, item content information, and network structure among items into the same model.

Experiments on real-world datasets show that our model can achieve better prediction accuracy than the state-

ETPL

DM - 008

Relational Collaborative Topic Regression for Recommender Systems

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing

user preferences because of large scale terms and data patterns. Most existing popular text mining and

classification methods have adopted term-based approaches. However, they have all suffered from the

problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-

based methods should perform better than term-based ones in describing user preferences; yet, how to

effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this

challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both

positive and negative patterns in text documents as higher level features and deploys them over low-level

features (terms). It also classifies terms into categories and updates term weights based on their specificity and

their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-

21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods

and the pattern based methods.

ETPL

DM - 009

Relevance Feature Discovery for Text Mining

Given the proliferation of review content, and the fact that reviews are highly diverse and often unnecessarily

verbose, users frequently face the problem of selecting the appropriate reviews to consume. Micro-reviews are

emerging as a new type of online review content in the social media. Micro-reviews are posted by users of

check-in services such as Foursquare. They are concise (up to 200 characters long) and highly focused, in

contrast to the comprehensive and verbose reviews. In this paper, we propose a novel mining problem, which

brings together these two disparate sources of review content. Specifically, we use coverage of micro-reviews

as an objective for selecting a set of reviews that cover efficiently the salient aspects of an entity. Our

approach consists of a two-step process: matching review sentences to micro-reviews, and selecting a small set

of reviews that cover as many micro-reviews as possible, with few sentences. We formulate this objective as a

combinatorial optimization problem, and show how to derive an optimal solution using Integer Linear

Programming. We also propose an efficient heuristic algorithm that approximates the optimal solution.

Finally, we perform a detailed evaluation of all the steps of our methodology using data collected from

Foursquare and Yelp.

ETPL

DM - 010

Review Selection Using Micro-Reviews

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

problems: (1) similar rules can be rated quite differently, (2) rules may not be found because they are

individually considered uninteresting, and (3) rules that are too specific are less likely to be used for making

predictions. To address these issues, we explore the idea of mining “partially-ordered sequential rules”

(POSR), a more general form of sequential rules such that items in the antecedent and the consequent of each

rule are unordered. To mine POSR, we propose the RuleGrowth algorithm, which is efficient and easily

extendable. In particular, we present an extension (TRuleGrowth) that accepts a sliding-window constraint to

find rules occurring within a maximum amount of time. A performance study with four real-life datasets show

that RuleGrowth and TRuleGrowth have excellent performance and scalability compared to baseline

algorithms and that the number of rules discovered can be several orders of magnitude smaller when the

sliding-window constraint is applied. Furthermore, we also report results from a real application showing that

POSR can provide a much higher prediction accuracy than regular sequential rules for sequence prediction

ETPL

DM - 011

Mining Partially-Ordered Sequential Rules Common to Multiple Sequences

The problem of mobile sequential recommendation is to suggest a route connecting a set of pick-up points for

a taxi driver so that he/she is more likely to get passengers with less travel cost. Essentially, a key challenge of

this problem is its high computational complexity. In this paper, we propose a novel dynamic programming

based method to solve the mobile sequential recommendation problem consisting of two separate stages: an

offline pre-processing stage and an online search stage. The offline stage pre-computes potential candidate

sequences from a set of pick-up points. A backward incremental sequence generation algorithm is proposed

based on the identified iterative property of the cost function. Simultaneously, an incremental pruning policy is

adopted in the process of sequence generation to reduce the search space of the potential sequences

effectively. In addition, a batch pruning algorithm is further applied to the generated potential sequences to

remove some non-optimal sequences of a given length. Since the pruning effectiveness keeps growing with the

increase of the sequence length, at the online stage, our method can efficiently find the optimal driving route

for an unloaded taxi in the remaining candidate sequences. Moreover, our method can handle the problem of

optimal route search with a maximum cruising distance or a destination constraint. Experimental results on

real and synthetic data sets show that both the pruning ability and the efficiency of our method surpass the

state-of-the-art methods. Our techniques can therefore be effectively employed to address the problem of

mobile sequential recommendation with many pick-up points in real-world applications.

ETPL

DM - 012

Backward Path Growth for Efficient Mobile Sequential Recommendation

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Mining opinion targets and opinion words from online reviews are important tasks for fine-grained opinion

mining, the key component of which involves detecting opinion relations among words. To this end, this paper

proposes a novel approach based on the partially-supervised alignment model, which regards identifying

opinion relations as an alignment process. Then, a graph-based co-ranking algorithm is exploited to estimate

the confidence of each candidate. Finally, candidates with higher confidence are extracted as opinion targets or

opinion words. Compared to previous methods based on the nearest-neighbor rules, our model captures

opinion relations more precisely, especially for long-span relations. Compared to syntax-based methods, our

word alignment model effectively alleviates the negative effects of parsing errors when dealing with informal

online texts. In particular, compared to the traditional unsupervised alignment model, the proposed model

obtains better precision because of the usage of partial supervision. In addition, when estimating candidate

confidence, we penalize higher-degree vertices in our graph-based co-ranking algorithm to decrease the

probability of error generation. Our experimental results on three corpora with different sizes and languages

show that our approach effectively outperforms state-of-the-art methods.

ETPL

DM - 013

Co-Extracting Opinion Targets and Opinion Words from Online Reviews Based

on the Word Alignment Model

Feature selection has been an important research topic in data mining, because the real data sets often have

highdimensional features, such as the bioinformatics and text mining applications. Many existing filter feature

selection methods rank features by optimizing certain feature ranking criterions, such that correlated features

often have similar rankings. These correlated features are redundant and don’t provide large mutual

information to help data mining. Thus, when we select a limit number of features, we hope to select the top

non-redundant features such that the useful mutual information can be maximized. In previous research, Ding

et al. recognized this important issue and proposed the mRMR (minimum Redundancy Maximum Relevance

Feature Selection) model to minimize the redundancy between sequentially selected features. However, this

method used the greedy search, thus the global feature redundancy wasn’t considered and the results are not

optimal. In this paper, we propose a new feature selection framework to globally minimize the feature

redundancy with maximizing the given feature ranking scores, which can come from any supervised or

unsupervised methods. Our new model has no parameter so that it is especially suitable for practical data

mining application. Experimental results on benchmark data sets show that the proposed method consistently

improves the feature selection results compared to the original methods. Meanwhile, we introduce a new

unsupervised global and local discriminative feature selection method which can be unified with the global

feature redundancy minimization framework and shows superior performance.

ETPL

DM - 014

Global Redundancy Minimization for Feature Ranking

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

There is an intense technological race underway to build the highest-performance and lowest-power custom

Bitcoin mining appliances using custom ASIC processors. This article describes the architecture and

implementation details of CoinTerra's first-generation Bitcoin mining processor, Goldstrike 1, and how this

processor is used to design a complete Bitcoin mining machine called Terraminer IV. Because of high power

density in the Bitcoin mining processor, delivering power and cooling the die posed enormous challenges.

This article describes some of the solutions adopted to overcome these challenges.

ETPL

DM – 015

Goldstrike 1: CoinTerra's First-Generation Cryptocurrency Mining Processor

for Bitcoin

In this study, the performance of two waterline extraction approaches is analyzed using dual-polarization

Cosmo-SkyMed (CSK) Synthetic Aperture Radar (SAR) data and ancillary ground truth information. The

single-polarization approach is based on multiscale normalized cuts segmentation; while, the dual-polarization

one exploits the inherent peculiarities of the CSK PING PONG incoherent dual-polarimetric imaging mode

together with a tailored scattering model to perform land/sea discrimination. The two approaches are applied

to the actual CSK SAR data collected over the coastal area of Shanghai, China. To provide a detailed and

complete validation of the two approaches, we carried out several field surveys collecting in situ ancillary

information including Global Positioning System (GPS) data and tidal information. Experimental results show

that 1) both approaches provide satisfactory results in extracting waterline from CSK SAR data in the

intertidal flat under low-to-moderate wind conditions and under a very broad range of incidence angles; 2) the

accuracy of the waterline extracted by both approaches decreases in case of water within the intertidal flat; 3)

the single-polarization approach is unsupervised when the land/sea contrast ratio is high. However, it needs

manual supervision to correct the extracted waterline when the land/sea contrast is low or in complex areas. A

typical CSK scene is processed in about 25 min; 4) the dual-polarization approach is unsupervised and very

effective: a typical CSK SAR scene is processed in seconds.

ETPL

DM - 016

Performance Analysis and Validation of Waterline Extraction Approaches

Using Single- and Dual-Polarimetric SAR Data

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Current tools that facilitate the extract-transform-load (ETL) process focus on ETL workflow, not on

generating meaningful semantic relationships to integrate data from multiple, heterogeneous sources. A

proposed semantic ETL framework applies semantics to various data fields and so allows richer data

integration.

ETPL

DM - 017

Integrating Big Data: A Semantic Extract-Transform-Load Framework

A description of patient conditions should consist of the changes in and combination of clinical

measures. Traditional data-processing method and classification algorithms might cause clinical

information to disappear and reduce prediction performance. To improve the accuracy of clinical-

outcome prediction by using multiple measurements, a new multiple-time-series data-processing

algorithm with period merging is proposed. Clinical data from 83 hepatocellular carcinoma (HCC)

patients were used in this research. Their clinical reports from a defined period were merged using the

proposed merging algorithm, and statistical measures were also calculated. After data processing,

multiple measurements support vector machine (MMSVM) with radial basis function (RBF) kernels

was used as a classification method to predict HCC recurrence. A multiple measurements random

forest regression (MMRF) was also used as an additional evaluation/classification method. To

evaluate the data-merging algorithm, the performance of prediction using processed multiple

measurements was compared to prediction using single measurements. The results of recurrence

prediction by MMSVM with RBF using multiple measurements and a period of 120 days (accuracy

0.771, balanced accuracy 0.603) were optimal, and their superiority to the results obtained using single

measurements was statistically significant (accuracy 0.626, balanced accuracy 0.459, P <; 0.01). In the

cases of MMRF, the prediction results obtained after applying the proposed merging algorithm were

also better than single-measurement results (P <; 0.05). The results show that the performance of

HCC-recurrence prediction was significantly improved when the proposed data-processing algorithm

was used, and that multiple measurements could be of greater value than single

ETPL

DM - 018

Multiple-Time-Series Clinical Data Processing for Classification With Merging

Algorithm and Statistical Measures

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

The increasing complexity and volume of data mandate tighter integration between analytics and visualization.

In this paper, I propose a pattern of integrating of computational and visual analytics techniques, called just-in-

time (JIT) interactive analytics. JIT analytics is performed in real-time on data that users are interacting with to

guide visual-analytic exploration. Fundamental to JIT analytics is enriching visualizations with annotations

that describe semantics of visual features, thereby suggesting to users possible insights to examine further. To

accomplish this, JIT analytics needs to 1) identify insights depicted as visual patterns such as clusters, outliers,

and trends in visualizations and 2) determine the semantics of such features by considering not only attributes

that are being visualized but also other attributes in data. In this paper, I describe the JIT interactive analytics

pattern, along with a generic implementation for any type of visualization and data, and provide a particular

implementation for point-based visualization of multivariate data. I argue that the pattern provides a useful

user experience by elevating the cognitive level of interaction with data from pure perception of visual

representations to understanding higher level semantics of data. As such, this supports users in building faster

qualitative mental models and accelerating discovery. Furthermore, facilitating insight opens new research

opportunities such as visual-analytic action recommendations, improved collaboration, and accessibility

ETPL

DM - 019

Just-in-time interactive analytics: Guiding visual exploration of data

In this paper, we propose a detection method based on data-driven target modeling, which implicitly handles

variations in the target appearance. Given a training set of images of the target, our approach constructs models

based on local neighborhoods within the training set. We present a new metric using these models and show

that, by controlling the notion of locality within the training set, this metric is invariant to perturbations in the

appearance of the target. Using this metric in a supervised graph framework, we construct a low-dimensional

embedding of test images. Then, a detection score based on the embedding determines the presence of a target

in each image. The method is applied to a data set of side-scan sonar images and achieves impressive results in

the detection of sea mines. The proposed framework is general and can be applied to different target detection

problems in a broad range of signals.

ETPL

DM - 020

Graph-Based Supervised Automatic Target Detection

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Software companies spend over 45 percent of cost in dealing with software bugs. An inevitable step of fixing

bugs is bug triage, which aims to correctly assign a developer to a new bug. To decrease the time cost in

manual work, text classification techniques are applied to conduct automatic bug triage. In this paper, we

address the problem of data reduction for bug triage, i.e., how to reduce the scale and improve the quality of

bug data. We combine instance selection with feature selection to simultaneously reduce data scale on the bug

dimension and the word dimension. To determine the order of applying instance selection and feature

selection, we extract attributes from historical bug data sets and build a predictive model for a new bug data

set. We empirically investigate the performance of data reduction on totally 600,000 bug reports of two large

open source projects, namely Eclipse and Mozilla. The results show that our data reduction can effectively

reduce the data scale and improve the accuracy of bug triage. Ourwork provides an approach to leveraging

techniques on data processing to form reduced and high-quality bug data in software development and

maintenance.

ETPL

DM - 021

Towards Effective Bug Triage with Software Data Reduction Techniques

Intelligently extracting knowledge from social media has recently attracted great interest from the Biomedical

and Health Informatics community to simultaneously improve healthcare outcomes and reduce costs using

consumer-generated opinion. We propose a two-step analysis framework that focuses on positive and negative

sentiment, as well as the side effects of treatment, in users' forum posts, and identifies user communities

(modules) and influential users for the purpose of ascertaining user opinion of cancer treatment. We used a

self-organizing map to analyze word frequency data derived from users' forum posts. We then introduced a

novel network-based approach for modeling users' forum interactions and employed a network partitioning

method based on optimizing a stability quality measure. This allowed us to determine consumer opinion and

identify influential users within the retrieved modules using information derived from both word-frequency

data and network-based properties. Our approach can expand research into intelligently mining social media

data for consumer opinion of various treatments to provide rapid, up-to-date information for the

pharmaceutical industry, hospitals, and medical staff, on the effectiveness (or ineffectiveness) of future

treatments.

ETPL

DM - 022

Network-Based Modeling and Intelligent Data Mining of Social Media for

Improving Care

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

We propose a novel approach for detecting precursors to epileptic seizures in intracranial

electroencephalograms (iEEGs), which is based on the analysis of system dynamics. In the proposed scheme,

the largest Lyapunov exponent (LLE) of wavelet entropy of the segmented EEG signals are considered as the

discriminating features. Such features are processed by a support vector machine classifier, whose outcomes

(the label and its probability for each LLE) are post-processed and fed into a novel decision function to

determine whether the corresponding segment of the EEG signal contains a precursor to an epileptic seizure.

The proposed scheme is applied to the Freiburg data set, and the results show that seizure precursors are

detected in a time frame that unlike other existing schemes is very much convenient to patients, with the

sensitivity of 100% and negligible false positive detection rates.

ETPL

DM - 023

Real-time mining of epileptic seizure precursors via nonlinear mapping and

dissimilarity features

Data Mining has wide applications in many areas such as banking, medicine, scientific research and among

government agencies. Classification is one of the commonly used tasks in data mining applications. For the past

decade, due to the rise of various privacy issues, many theoretical and practical solutions to the classification

problem have been proposed under different security models. However, with the recent popularity of cloud

computing, users now have the opportunity to outsource their data, in encrypted form, as well as the data mining

tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-preserving classification

techniques are not applicable. In this paper, we focus on solving the classification problem over encrypted data.

In particular, we propose a secure k-NN classifier over encrypted data in the cloud. The proposed protocol

protects the confidentiality of data, privacy of user's input query, and hides the data access patterns. To the best

of our knowledge, our work is the first to develop a secure k-NN classifier over encrypted data under the semi-

honest model. Also, we empirically analyze the efficiency of our proposed protocol using a real-world dataset

under different parameter settings.

ETPL

DM - 024

k-Nearest Neighbor Classification over Semantically Secure Encrypted Relational

Data

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

As a newly emerging network model, heterogeneous information networks (HINs) have received growing

attention. Many data mining tasks have been explored in HINs, including clustering, classification, and

similarity search. Similarity join is a fundamental operation required for many problems. It is attracting

attention from various applications on network data, such as friend recommendation, link prediction, and

online advertising. Although similarity join has been well studied in homogeneous networks, it has not yet

been studied in heterogeneous networks. Especially, none of the existing research on similarity join takes

different semantic meanings behind paths into consideration and almost all completely ignore the

heterogeneity and diversity of the HINs. In this paper, we propose a path-based similarity join (PS-join)

method to return the top k similar pairs of objects based on any user specified join path in a heterogeneous

information network. We study how to prune expensive similarity computation by introducing bucket pruning

based locality sensitive hashing (BPLSH) indexing. Compared with existing Link-based Similarity join (LS-

join) method, PS-join can derive various similarity semantics. Experimental results on real data sets show the

efficiency and effectiveness of the proposed approach.

ETPL

DM - 025

Top-k Similarity Join in Heterogeneous Information Networks

Ranking of association rules is currently an interesting topic in data mining and bioinformatics. The huge

number of evolved rules of items (or, genes) by association rule mining (ARM) algorithms makes confusion to

the decision maker. In this article, we propose a weighted rule-mining technique (say, RANWAR or rank-

based weighted association rule-mining) to rank the rules using two novel rule-interestingness measures, viz.,

rank-based weighted condensed support (wcs) and weighted condensed confidence (wcc) measures to bypass

the problem. These measures are basically depended on the rank of items (genes). Using the rank, we assign

weight to each item. RANWAR generates much less number of frequent itemsets than the state-of-the-art

association rule mining algorithms. Thus, it saves time of execution of the algorithm. We run RANWAR on

gene expression and methylation datasets. The genes of the top rules are biologically validated by Gene

Ontologies (GOs) and KEGG pathway analyses. Many top ranked rules extracted from RANWAR that hold

poor ranks in traditional Apriori, are highly biologically significant to the related diseases. Finally, the top

rules evolved from RANWAR, that are not in Apriori, are reported.

ETPL

DM - 026

RANWAR: Rank-Based Weighted Association Rule Mining From Gene

Expression and Methylation Data

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

In this paper, a new method for constructing decision trees for stream data is proposed. First a new splitting

criterion based on the misclassification error is derived. A theorem is proven showing that the best attribute

computed in considered node according to the available data sample is the same, with some high probability,

as the attribute derived from the whole infinite data stream. Next this result is combined with the splitting

criterion based on the Gini index. It is shown that such combination provides the highest accuracy among all

studied algorithms.

ETPL

DM - 027

A New Method for Data Stream Mining Based on the Misclassification Error

Learning to rank arises in many data mining applications, ranging from web search engine, online advertising

to recommendation system. In learning to rank, the performance of a ranking model is strongly affected by the

number of labeled examples in the training set; on the other hand, obtaining labeled examples for training data

is very expensive and time-consuming. This presents a great need for the active learning approaches to select

most informative examples for ranking learning; however, in the literature there is still very limited work to

address active learning for ranking. In this paper, we propose a general active learning framework, expected

loss optimization (ELO), for ranking. The ELO framework is applicable to a wide range of ranking functions.

Under this framework, we derive a novel algorithm, expected discounted cumulative gain (DCG) loss

optimization (ELO-DCG), to select most informative examples. Then, we investigate both query and

document level active learning for raking and propose a two-stage ELO-DCG algorithm which incorporate

both query and document selection into active learning. Furthermore, we show that it is flexible for the

algorithm to deal with the skewed grade distribution problem with the modification of the loss function.

Extensive experiments on real-world web search data sets have demonstrated great potential and effectiveness

of the proposed framework and algorithms.

ETPL

DM - 028

Active Learning for Ranking through Expected Loss Optimization

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

It is a big challenge to guarantee the quality of discovered relevance features in text documents for describing

user preferences because of large scale terms and data patterns. Most existing popular text mining and

classification methods have adopted term-based approaches. However, they have all suffered from the

problems of polysemy and synonymy. Over the years, there has been often held the hypothesis that pattern-

based methods should perform better than term-based ones in describing user preferences; yet, how to

effectively use large scale patterns remains a hard problem in text mining. To make a breakthrough in this

challenging issue, this paper presents an innovative model for relevance feature discovery. It discovers both

positive and negative patterns in text documents as higher level features and deploys them over low-level

features (terms). It also classifies terms into categories and updates term weights based on their specificity and

their distributions in patterns. Substantial experiments using this model on RCV1, TREC topics and Reuters-

21578 show that the proposed model significantly outperforms both the state-of-the-art term-based methods

and the pattern based methods.

ETPL

DM - 030

Relevance Feature Discovery for Text Mining

The classical fuzzy system modeling methods implicitly assume data generated from a single task, which is

essentially not in accordance with many practical scenarios where data can be acquired from the perspective of

multiple tasks. Although one can build an individual fuzzy system model for each task, the result indeed tells

us that the individual modeling approach will get poor generalization ability due to ignoring the intertask

hidden correlation. In order to circumvent this shortcoming, we consider a general framework for preserving

the independent information among different tasks and mining hidden correlation information among all tasks

in multitask fuzzy modeling. In this framework, a low-dimensional subspace (structure) is assumed to be

shared among all tasks and hence be the hidden correlation information among all tasks. Under this

framework, a multitask Takagi-Sugeno-Kang (TSK) fuzzy system model called MTCS-TSK-FS (TSK-FS for

multiple tasks with common hidden structure), based on the classical L2-norm TSK fuzzy system, is proposed

in this paper. The proposed model can not only take advantage of independent sample information from the

original space for each task, but also effectively use the intertask common hidden structure among multiple

tasks to enhance the generalization performance of the built fuzzy systems. Experiments on synthetic and real-

world datasets demonstrate the applicability and distinctive performance of the proposed multitask fuzzy

system model in multitask regression learning scenarios.

ETPL

DM - 031

Multitask TSK Fuzzy System Modeling by Mining Intertask Common Hidden

Structure

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

The problem of mobile sequential recommendation is to suggest a route connecting a set of pick-up points for

a taxi driver so that he/she is more likely to get passengers with less travel cost. Essentially, a key challenge of

this problem is its high computational complexity. In this paper, we propose a novel dynamic programming

based method to solve the mobile sequential recommendation problem consisting of two separate stages: an

offline pre-processing stage and an online search stage. The offline stage pre-computes potential candidate

sequences from a set of pick-up points. A backward incremental sequence generation algorithm is proposed

based on the identified iterative property of the cost function. Simultaneously, an incremental pruning policy is

adopted in the process of sequence generation to reduce the search space of the potential sequences effectively.

In addition, a batch pruning algorithm is further applied to the generated potential sequences to remove some

non-optimal sequences of a given length. Since the pruning effectiveness keeps growing with the increase of

the sequence length, at the online stage, our method can efficiently find the optimal driving route for an

unloaded taxi in the remaining candidate sequences. Moreover, our method can handle the problem of optimal

route search with a maximum cruising distance or a destination constraint. Experimental results on real and

synthetic data sets show that both the pruning ability and the efficiency of our method surpass the state-of-the-

art methods. Our techniques can therefore be effectively employed to address the problem of mobile sequential

recommendation with many pick-up points in real-world applications.

ETPL

DM - 032

Backward Path Growth for Efficient Mobile Sequential Recommendation

Feature selection has been an important research topic in data mining, because the real data sets often have

highdimensional features, such as the bioinformatics and text mining applications. Many existing filter feature

selection methods rank features by optimizing certain feature ranking criterions, such that correlated features

often have similar rankings. These correlated features are redundant and don’t provide large mutual

information to help data mining. Thus, when we select a limit number of features, we hope to select the top

non-redundant features such that the useful mutual information can be maximized. In previous research, Ding

et al. recognized this important issue and proposed the mRMR (minimum Redundancy Maximum Relevance

Feature Selection) model to minimize the redundancy between sequentially selected features. However, this

method used the greedy search, thus the global feature redundancy wasn’t considered and the results are not

optimal. In this paper, we propose a new feature selection framework to globally minimize the feature

redundancy with maximizing the given feature ranking scores, which can come from any supervised or

unsupervised methods. Our new model has no parameter so that it is especially suitable for practical data

ETPL

DM - 033

Global Redundancy Minimization for Feature Ranking

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Data Mining has wide applications in many areas such as banking, medicine, scientific research and among

government agencies. Classification is one of the commonly used tasks in data mining applications. For the

past decade, due to the rise of various privacy issues, many theoretical and practical solutions to the

classification problem have been proposed under different security models. However, with the recent

popularity of cloud computing, users now have the opportunity to outsource their data, in encrypted form, as

well as the data mining tasks to the cloud. Since the data on the cloud is in encrypted form, existing privacy-

preserving classification techniques are not applicable. In this paper, we focus on solving the classification

problem over encrypted data. In particular, we propose a secure k-NN classifier over encrypted data in the

cloud. The proposed protocol protects the confidentiality of data, privacy of user's input query, and hides the

data access patterns. To the best of our knowledge, our work is the first to develop a secure k-NN classifier

over encrypted data under the semi-honest model. Also, we empirically analyze the efficiency of our proposed

protocol using a real-world dataset under different parameter settings.

ETPL

DM - 035

Secure k -NN Classification over Semantically Secure Encrypted Relational

Data

Recently, two ideas have been explored that lead to more accurate algorithms for time-series classification

(TSC). First, it has been shown that the simplest way to gain improvement on TSC problems is to transform

into an alternative data space where discriminatory features are more easily detected. Second, it was

demonstrated that with a single data representation, improved accuracy can be achieved through simple

ensemble schemes. We combine these two principles to test the hypothesis that forming a collective of

ensembles of classifiers on different data transformations improves the accuracy of time-series classification.

The collective contains classifiers constructed in the time, frequency, change, and shapelet transformation

domains. For the time domain we use a set of elastic distance measures. For the other domains we use a range

of standard classifiers. Through extensive experimentation on 72 datasets, including all of the 46 UCR

datasets, we demonstrate that the simple collective formed by including all classifiers in one ensemble is

significantly more accurate than any of its components and any other previously published TSC algorithm. We

investigate alternative hierarchical collective structures and demonstrate the utility of the approach on a new

problem involving classifying Caenorhabditis elegans mutant types.

ETPL

DM - 036

Time-Series Classification with COTE: The Collective of Transformation-Based

Ensembles

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Recommender systems are promising for providing personalized favorite services. Collaborative filtering (CF)

technologies, making prediction of users’ preference based on users’ previous behaviors, have become one of

the most successful techniques to build modern recommender systems. Several challenging issues occur in

previously proposed CF methods: 1) most CF methods ignore users’ response patterns and may yield biased

parameter estimation and suboptimal performance; 2) some CF methods adopt heuristic weight settings, which

lacks a systematical implementation; 3) the multinomial mixture models may weaken the computational

ability of matrix factorization for generating the data matrix, thus increasing the computational cost of

training. To resolve these issues, we incorporate users’ response models into the probabilistic matrix

factorization (PMF), a popular matrix factorization CF model, to establish the Response Aware Probabilistic

Matrix Factorization (RAPMF) framework. More specifically, we make the assumption on the user response

as a Bernoulli distribution which is parameterized by the rating scores for the observed ratings while as a step

function for the unobserved ratings. Moreover, we speed up the algorithm by a mini-batch implementation and

a crafting scheduling policy. Finally, we design different experimental protocols and conduct systematical

empirical evaluation on both synthetic and real-world datasets to demonstrate the merits of the proposed

RAPMF and its mini-batch implementation.

ETPL

DM - 037

Boosting Response Aware Model-Based Collaborative Filtering

Over the past decade or so, several research groups have addressed the problem of multi-label classification

where each example can belong to more than one class at the same time. A common approach, called Binary

Relevance (BR), addresses this problem by inducing a separate classifier for each class. Research has shown

that this framework can be improved if mutual class dependence is exploited: an example that belongs to class

X is likely to belong also to class Y ; conversely, belonging to X can make an example less likely to belong to

Z. Several works sought to model this information by using the vector of class labels as additional example

attributes. To fill the unknown values of these attributes during prediction, existing methods resort to using

outputs of other classifiers, and this makes them prone to errors. This is where our paper wants to contribute.

We identified two potential ways to prune unnecessary dependencies and to reduce error-propagation in our

new classifier-stacking technique, which is named PruDent. Experimental results indicate that the

classification performance of PruDent compares favorably with that of other state-of-the-art approaches over a

broad range of testbeds. Moreover, its computational costs grow only linearly in the number of classes.

ETPL

DM - 038

PruDent: A Pruned and Confident Stacking Approach for Multi-label

Classification

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Recently, there has been a growing interest in designing differentially private data mining algorithms.

Frequent itemset mining (FIM) is one of the most fundamental problems in data mining. In this paper, we

explore the possibility of designing a differentially private FIM algorithm which can not only achieve high

data utility and a high degree of privacy, but also offer high time efficiency. To this end, we propose a

differentially private FIM algorithm based on the FP-growth algorithm, which is referred to as PFP-growth.

The PFP-growth algorithm consists of a preprocessing phase and a mining phase. In the preprocessing phase,

to improve the utility and privacy tradeoff, a novel smart splitting method is proposed to transform the

database. For a given database, the preprocessing phase needs to be performed only once. In the mining phase,

to offset the information loss caused by transaction splitting, we devise a run-time estimation method to

estimate the actual support of itemsets in the original database. In addition, by leveraging the downward

closure property, we put forward a dynamic reduction method to dynamically reduce the amount of noise

added to guarantee privacy during the mining process. Through formal privacy analysis, we show that our

PFP-growth algorithm is -differentially private. Extensive experiments on real datasets illustrate that our PFP-

growth algorithm substantially outperforms the state-of-the-art techniques.

ETPL

DM - 039

Differentially Private Frequent Itemset Mining via Transaction Splitting

Given a spatio-temporal network, a source, a destination, and a desired departure time interval, the All-

departuretime Lagrangian Shortest Paths (ALSP) problem determines a set which includes the shortest path for

every departure time in the given interval. ALSP is important for critical societal applications such as eco-

routing. However, ALSP is computationally challenging due to the non-stationary ranking of the candidate

paths across distinct departure-times. Current related work for reducing the redundant work, across consecutive

departure-times sharing a common solution, exploits only partial information e.g., the earliest feasible arrival

time of a path. In contrast, our approach uses all available information, e.g., the entire time series of arrival

times for all departure-times. This allows elimination of all knowable redundant computation based on complete

information available at hand. We operationalize this idea through the concept of critical-time-points (CTP),

i.e., departure-times before which ranking among candidate paths cannot change. In our preliminary work, we

proposed a CTP based forward search strategy. In this paper we propose a CTP based temporal bi-directional

search for the ALSP problem via a novel impromptu rendezvous termination condition. Theoretical and

experimental analysis show that the proposed approach outperforms the related work approaches particularly

when there are few critical-time-points

ETPL

DM - 040

A Critical-time-point Approach to All-departure-time Lagrangian Shortest

Paths

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Numerical typing errors can lead to serious consequences, but various causes of human errors and the lack of

contextual clues in numerical typing make their prediction difficult. Human behavior modeling can predict

the general tendency in making errors, while data mining can recognize neurophysiological feedback in

detecting cognitive abnormality on a trial-by-trial basis. This study suggests integrating human behavior

modeling and data mining to predict human errors because it utilizes both 1) top-down inference to transform

interactions between task characteristics and conditions into a general inclination of an average operator to

make errors and 2) bottom-up analysis in parsing psychophysiological measurements into an individual's

likelihood of making errors on a trial-by-trial basis. Real-time electroencephalograph (EEG) features

collected in a numerical typing experiment and modeling features produced by an enhanced human behavior

model (queuing network model human processor) were combined to improve error classification

performance by a linear discriminant analysis (LDA) classifier. Integrating EEG and modeling features

improved the results of LDA classification by 28.3% in keenness (d') and by 10.7% in the area under ROC

curve (AUC) from that of using EEG only; it also outperformed the other three benchmarking scenarios:

using behaviors only, using apparent task features, and using task features plus trial information. The AUC

was significantly increased from using EEG along only if EEG + Model features were used.

ETPL

DM - 041 Integrating Human Behavior Modeling and Data Mining Techniques to Predict

Human Errors in Numerical Typing

Intelligently extracting knowledge from social media has recently attracted great interest from the

Biomedical and Health Informatics community to simultaneously improve healthcare outcomes and reduce

costs using co nsumer-generated opinion. We propose a two-step analysis framework that focuses on positive

and negative sentiment, as well as the side effects of treatment, in users' forum posts, and identifies user

communities (modules) and influential users for the purpose of ascertaining user opinion of cancer treatment.

We used a self-organizing map to analyze word frequency data derived from users' forum posts. We then

introduced a novel network-based approach for modeling users' forum interactions and employed a network

partitioning method based on optimizing a stability quality measure. This allowed us to determine consumer

opinion and identify influential users within the retrieved modules using information derived from both

word-frequency data and network-based properties. Our approach can expand research into intelligently

mining social media data for consumer opinion of various treatments to provide rapid, up-to-date information

for the pharmaceutical industry, hospitals, and medical staff, on the effectiveness (or ineffectiveness) of

future treatments.

ETPL

DM - 042 Network-Based Modeling and Intelligent Data Mining of Social Media for

Improving Care

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

A novel data mining method was developed to gauge the experience of the drug Sitagliptin (trade name

Januvia) by patients with diabetes mellitus type 2. To this goal, we devised a two-step analysis framework.

Initial exploratory analysis using self-organizing maps was performed to determine structures based on user

opinions among the forum posts. The results were a compilation of user's clusters and their correlated (positive

or negative) opinion of the drug. Subsequent modeling using network analysis methods was used to determine

influential users among the forum members. These findings can open new avenues of research into rapid data

collection, feedback, and analysis that can enable improved outcomes and solutions for public health and

important feedback for the manufacturer.

ETPL

DM - 043 A Novel Data-Mining Approach Leveraging Social Media to Monitor Consumer

Opinion of Sitagliptin

The High Efficiency Video Coding standard provides improved compression ratio in comparison with its

predecessors at the cost of large increases in the encoding computational complexity. An important share of

this increase is due to the new flexible partitioning structures, namely the coding trees, the prediction units,

and the residual quadtrees, with the best configurations decided through an exhaustive rate-distortion

optimization (RDO) process. In this paper, we propose a set of procedures for deciding whether the partition

structure optimization algorithm should be terminated early or run to the end of an exhaustive search for the

best configuration. The proposed schemes are based on decision trees obtained through data mining

techniques. By extracting intermediate data, such as encoding variables from a training set of video sequences,

three sets of decision trees are built and implemented to avoid running the RDO algorithm to its full extent.

When separately implemented, these schemes achieve average computational complexity reductions (CCRs)

of up to 50% at a negligible cost of 0.56% in terms of Bjontegaard Delta (BD) rate increase. When the

schemes are jointly implemented, an average CCR of up to 65% is achieved, with a small BD-rate increase of

1.36%. Extensive experiments and comparisons with similar works demonstrate that the proposed early

termination schemes achieve the best rate-distortion-complexity tradeoffs among all the compared works.

ETPL

DM - 044

Fast HEVC Encoding Decisions Using Data Mining

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Wind energy integration research generally relies on complex sensors located at remote sites. The procedure

for generating high-level synthetic information from databases containing large amounts of low-level data

must therefore account for possible sensor failures and imperfect input data. The data input is highly sensitive

to data quality. To address this problem, this paper presents an empirical methodology that can efficiently

preprocess and filter the raw wind data using only aggregated active power output and the corresponding wind

speed values at the wind farm. First, raw wind data properties are analyzed, and all the data are divided into

six categories according to their attribute magnitudes from a statistical perspective. Next, the weighted

distance, a novel concept of the degree of similarity between the individual objects in the wind database and

the local outlier factor (LOF) algorithm, is incorporated to compute the outlier factor of every individual

object, and this outlier factor is then used to assess which category an object belongs to. Finally, the

methodology was tested successfully on the data collected from a large wind farm in northwest China.

ETPL

DM - 045

Raw Wind Data Preprocessing: A Data-Mining Approach

This work deals with the problem of producing a fast and accurate data classification, learning it from a

possibly small set of records that are already classified. The proposed approach is based on the framework of

the so-called Logical Analysis of Data (LAD), but enriched with information obtained from statistical

considerations on the data. A number of discrete optimization problems are solved in the different steps of the

procedure, but their computational demand can be controlled. The accuracy of the proposed approach is

compared to that of the standard LAD algorithm, of Support Vector Machines and of Label Propagation

algorithm on publicly available datasets of the UCI repository. Encouraging results are obtained and discussed

ETPL

DM - 046

Effective Classification using a small Training Set based on Discretization and

Statistical Analysis

E-healthcare systems have been increasingly facilitating health condition monitoring, disease modeling and

early intervention, and evidence-based medical treatment by medical text mining and image feature extraction.

Owing to the resource constraint of wearable mobile devices, it is required to outsource the frequently

collected personal health information (PHI) into the cloud. Unfortunately, delegating both storage and

computation to the untrusted entity would bring a series of security and privacy issues. The existing work

mainly focused on fine-grained privacy-preserving static medical text access and analysis, which can hardly

afford the dynamic health condition fluctuation and medical image analysis. In this paper, a secure and

ETPL

DM - 047

PPDM: Privacy-preserving Protocol for Dynamic Medical Text Mining and Image

Feature Extraction from Secure Data Aggregation in Cloud-assisted e-Healthcare

Systems

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Currently, most computer systems use user IDs and passwords as the login patterns to authenticate users.

However, many people share their login patterns with coworkers and request these coworkers to assist co-

tasks, thereby making the pattern as one of the weakest points of computer security. Insider attackers, the valid

users of a system who attack the system internally, are hard to detect since most intrusion detection systems

and firewalls identify and isolate malicious behaviors launched from the outside world of the system only. In

addition, some studies claimed that analyzing system calls (SCs) generated by commands can identify these

commands, with which to accurately detect attacks, and attack patterns are the features of an attack. Therefore,

in this paper, a security system, named the Internal Intrusion Detection and Protection System (IIDPS), is

proposed to detect insider attacks at SC level by using data mining and forensic techniques. The IIDPS creates

users' personal profiles to keep track of users' usage habits as their forensic features and determines whether a

valid login user is the account holder or not by comparing his/her current computer usage behaviors with the

patterns collected in the account holder's personal profile. The experimental results demonstrate that the

IIDPS's user identification accuracy is 94.29%, whereas the response time is less than 0.45 s, implying that it

can prevent a protected system from insider attacks effectively and efficiently.

ETPL

DM - 048

An Internal Intrusion Detection and Protection System by Using Data Mining

and Forensic Techniques

In this paper, a multiple classifier machine learning (ML) methodology for predictive maintenance (PdM) is

presented. PdM is a prominent strategy for dealing with maintenance issues given the increasing need to

minimize downtime and associated costs. One of the challenges with PdM is generating the so-called “health

factors,” or quantitative indicators, of the status of a system associated with a given maintenance issue, and

determining their relationship to operating costs and failure risk. The proposed PdM methodology allows

dynamical decision rules to be adopted for maintenance management, and can be used with high-dimensional

and censored data problems. This is achieved by training multiple classification modules with different

prediction horizons to provide different performance tradeoffs in terms of frequency of unexpected breaks and

unexploited lifetime, and then employing this information in an operating cost-based maintenance decision

system to minimize expected costs. The effectiveness of the methodology is demonstrated using a simulated

example and a benchmark semiconductor manufacturing maintenance problem.

ETPL

DM- 049

Machine Learning for Predictive Maintenance: A Multiple Classifier Approach

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

Location-based sequential event prediction is an interesting problem with many real-world applications. For

example, knowing when and where people will use certain kinds of services could enable the development of

robust anticipatory systems. A key to this problem is in understanding the nature of the process from which

sequential data arises. Usually, human behavior exhibits distinct spatial, temporal, and social patterns. The

authors examine three kinds of patterns extracted from sequential purchasing events and propose a novel

model that captures contextual dependencies in spatial sequence, customers' temporal preferences, and social

influence via an implicit network. Their model outperforms existing models based on evaluations using a real-

world dataset of smartcard transaction records from a large educational institution with 13,753 students during

a 10-month time period

ETPL

DM - 050

Predicting Location-Based Sequential Purchasing Events by Using Spatial, Temporal,

and Social Patterns

There is an unprecedented trend that content providers (CPs) are building their own content delivery networks

(CDNs) to provide a variety of content services to their users. By exploiting powerful CP-level information in

content distribution, these CP-built CDNs open up a whole new design space and are changing the content

delivery landscape. In this paper, we adopt a measurement-based approach to understanding why, how, and

how much CP-level intelligences can help content delivery. We first present a measurement study of the CDN

built by Tencent, a largest content provider based in China. We observe new characteristics and trends in

content delivery which pose great challenges to the conventional content delivery paradigm and motivate the

proposal of CPCDN, a CDN powered by CP-aware information. We then reveal the benefits obtained by

exploiting two indispensable CP-level intelligences, namely context intelligence and user intelligence, in

content delivery. Inspired by the insights learnt from the measurement studies, we systematically explore the

design space of CPCDN and present the novel architecture and algorithms to address the new content delivery

challenges that have arisen. Our results not only demonstrate the potential of CPCDN in pushing content

delivery performance to the next level, but also identify new research problems calling for further

investigation.

ETPL

DM - 051

CPCDN: Content Delivery Powered by Context and User Intelligence

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

In this paper, a new method for constructing decision trees for stream data is proposed. First a new splitting

criterion based on the misclassification error is derived. A theorem is proven showing that the best attribute

computed in considered node according to the available data sample is the same, with some high probability,

as the attribute derived from the whole infinite data stream. Next this result is combined with the splitting

criterion based on the Gini index. It is shown that such combination provides the highest accuracy among all

studied algorithms.

ETPL

DM - 052

A New Method for Data Stream Mining Based on the Misclassification Error

This paper first introduces pattern aided regression (PXR) models, a new type of regression models designed to

represent accurate and interpretable prediction models. This was motivated by two observations: (1)

Regression modeling applications often involve complex diverse predictor-response relationships, which occur

when the optimal regression models (of given regression model type) fitting two or more distinct logical

groups of data are highly different. (2) State-of-the-art regression methods are often unable to adequately

model such relationships. This paper defines PXR models using several patterns and local regression models,

which respectively serve as logical and behavioral characterizations of distinct predictor-response

relationships. The paper also introduces a contrast pattern aided regression (CPXR) method, to build accurate

PXR models. In experiments, the PXR models built by CPXR are very accurate in general, often

outperforming state-of-the-art regression methods by big margins. Usually using (a) around seven simple

patterns and (b) linear local regression models, those PXR models are easy to interpret; in fact, their

complexity is just a bit higher than that of (piecewise) linear regression models and is significantly lower than

that of traditional ensemble based regression models. CPXR is especially effective for high-dimensional data.

The paper also discusses how to use CPXR methodology for analyzing prediction models and correcting their

prediction errors.

ETPL

DM - 053

Pattern-Aided Regression Modeling and Prediction Model Analysis

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

In collaborative environments, members may try to acquire similar information on the Web in order to gain

knowledge in one domain. For example, in a company several departments may successively need to buy

business intelligence software and employees from these departments may have studied online about different

business intelligence tools and their features independently. It will be productive to get them connected and

share learned knowledge. We investigate fine-grained knowledge sharing in collaborative environments. We

propose to analyze members’ Web surfing data to summarize the fine-grained knowledge acquired by them. A

two-step framework is proposed for mining fine-grained knowledge: (1) Web surfing data is clustered into

tasks by a nonparametric generative model; (2) a novel discriminative infinite Hidden Markov Model is

developed to mine fine-grained aspects in each task. Finally, the classic expert search method is applied to the

mined results to find proper members for knowledge sharing. Experiments on Web surfing data collected from

our lab at UCSB and IBM show that the finegrained aspect mining framework works as expected and

outperforms baselines. When it is integrated with expert search, the search accuracy improves significantly, in

comparison with applying the classic expert search method directly on Web surfing data.

ETPL

DM - 054

Fine-Grained Knowledge Sharing in Collaborative Environments

High utility sequential pattern mining has been considered as an important research problem and a number of

relevant algorithms have been proposed for this topic. The main challenge of high utility sequential pattern

mining is that, the search space is large and the efficiency of the solutions is directly affected by the degree at

which they can eliminate the candidate patterns. Therefore, the efficiency of any high utility sequential pattern

mining solution depends on its ability to reduce this big search space, and as a result, lower the computational

complexity of calculating the utilities of the candidate patterns. In this paper, we propose efficient data

structures and pruning technique which is based on Cumulated Rest of Match (CRoM) based upper bound.

CRoM, by defining a tighter upper bound on the utility of the candidates, allows more conservative pruning

before candidate pattern generation in comparison to the existing techniques. In addition, we have developed

an efficient algorithm, HuspExt (High Utility Sequential Pattern Extraction), which calculates the utilities of

the child patterns based on that of the parents’. Substantial experiments on both synthetic and real datasets

from different domains show that, the proposed solution efficiently discovers high utility sequential patterns

from large scale datasets with different data characteristics, under low utility thresholds.

ETPL

DM - 055

CRoM and HuspExt: Improving Efficiency of High Utility Sequential Pattern

Extraction

Elysium Technologies Private Limited Singapore | Madurai | Chennai | Trichy | Ramnad Erode | Tirunelveli|

Sivakasi |Dindugul|

http://www.elysiumtechnologies.com, [email protected]

This paper addresses the problem of keyword extraction from conversations, with the goal of using these

keywords to retrieve, for each short conversation fragment, a small number of potentially relevant documents,

which can be recommended to participants. However, even a short fragment contains a variety of words,

which are potentially related to several topics; moreover, using an automatic speech recognition (ASR) system

introduces errors among them. Therefore, it is difficult to infer precisely the information needs of the

conversation participants. We first propose an algorithm to extract keywords from the output of an ASR

system (or a manual transcript for testing), which makes use of topic modeling techniques and of a submodular

reward function which favors diversity in the keyword set, to match the potential diversity of topics and reduce

ASR noise. Then, we propose a method to derive multiple topically separated queries from this keyword set, in

order to maximize the chances of making at least one relevant recommendation when using these queries to

search over the English Wikipedia. The proposed methods are evaluated in terms of relevance with respect to

conversation fragments from the Fisher, AMI, and ELEA conversational corpora, rated by several human

judges. The scores show that our proposal improves over previous methods that consider only word frequency

or topic similarity, and represents a promising solution for a document recommender system to be used in

conversations.

ETPL

DM - 056

Keyword Extraction and Clustering for Document Recommendation in

Conversations

Thank You !