THIC MedIX Summer 2015 Poster
-
Upload
diana-zajac -
Category
Documents
-
view
14 -
download
0
Transcript of THIC MedIX Summer 2015 Poster
Thresholded Hierarchical Itemset Clustering for Expert ExplorationsDiana Zajac, Thomas Lux, Dr. Jacob Furst, Dr. Daniela Raicu
College of Computing and Digital Media, DePaul University
Summer 2015
Introduction Clustering Algorithms THIC
Datasets
Traditional Machine Learning (ML) techniques are able to
cluster datasets, yet they produce difficult to interpret clusters.
Noise in the data, as well as high-dimensional and complex
data, can make clustering difficult, and produce undesirable
results. In addition, most clustering algorithms produce clusters
without any explanation as to what patters are found between
data points, and based on what patters those clusters were
formed. In attempt to solve the problem of clustering high-
dimensional, complex and noisy datasets, and producing
interpretable results, we created an interactive user-interface
called THIC. THIC stands for Thresholded Hierarchical Itemset
Clustering, and we have given it this name to describe the
method in which it clusters data. What makes THIC so
innovative, is it’s ability to modify the clustering algorithm with
‘expert’ feedback. An ‘expert’ referring to some outside source
of information that can provide intuitive guidance as to what
features the algorithm should cluster upon.
Figure 1 is a part of the 2012 City Livability dataset obtained with permission of The Economic
Intelligence Unit (EIU) from their collaboration with BuzzData.
Another example, given an ‘expert’ who is well-traveled,
the expert could instruct THIC to group countries “most homey”
under one cluster, countries “most beautiful” under another,
etc. THIC will cluster the cities based on the experts guidance,
but will also predict which clusters the cities the expert hasn’t
yet traveled to may fit into—and then explain which city
features are most important in determining the clusters.
Other datasets we worked with included a large text
corpus, lung cancer data, and Chronic Fatigue Syndrome data.
K-Means:
K-Means clustering is an
algorithm that makes k number of
clusters based on distances of each
data-point from the cluster centers. It
begins by plotting each data point—
in the case of City Livability, each
city is a point—with the features as
dimensions. For an n number of
features, there are n number of
dimensions. So each point has a
given (x, y, z, …, n) coordinate
based on its features. K-Means chooses initial cluster centers, and then
iteratively moves them until the distances of the points to the centers is
minimal, and the clusters are separated as best as possible.
K-Means with Feature Selection (KMFS):
KMFS uses feature selection algorithms in aiding k-means clustering.
Feature selection is usually used in order to strip a dataset of irrelevant,
corrupted, or redundant features, thereby enhancing the analysis capabilities
based on those features. KMFS selects features one-by-one starting with
those that create the ‘best’—most defined and separate—clusters, and
continues to add features until the clusters become ‘bad’—overlapping and
spread-out. Incorporating feature selection into k-means clustering allows for k-
means to cluster data and return to the use the most relevant features used.
KMFS gives the user an idea of what each cluster is based on (what features
‘trend’ in each cluster), but it describes cluster features based on probabilities
rather than 100% accuracy, and also fails to provide user-control.
Why THIC is better:
Expert-guided clustering
Better data interpretability
Many different possibilities (for results)
Provides a controllable tradeoff between optimal results and meaningful
results
Doesn’t lose data dimensionality (no important information lost in feature
selection)
THIC’s philosophy is focused on aiding a user in understanding and
exploring datasets, finding unseen patterns and correlations in datasets, and
creating unconventional clustering of data.
Group 1: High: Green Space, SprawlGroup 2: Low: Sprawl, Culture and Environment, InfrastructureGroup 3: Low: InfrastructureHigh: Green SpaceGroup 4: Low: Sprawl, Culture and EnvironmentGroup 5: Low: Green Space, SprawlGroup 6: Low: Green SpaceHigh: SprawlGroup 7: Low: SprawlHigh: Green Space
The dataset below is one of the datasets we used in
testing THIC. This dataset is particularly interesting because of
the ‘expert feedback’ opportunity. For example, an expert may
want to cluster cities based on “what do European countries
have in common:” the expert would instruct THIC to group
European countries under one cluster, and THIC will produce
results explaining which features all European cities have in
common.
THIC is an interactive interface that allows users to import a numerical
dataset and cluster the data based on their own preferences, such as:
Which features should be included/excluded
Which features should be given higher priority (more weight)
Sizes of groups
Making subgroups
Number of groups
Define groups using features
Control between optimal clustering and clustering meaningful to user
Acknowledgments Dr. Jacob Furst, PhD, 1998, professor, DePaul University, CDM
Dr. Daniela Raicu, PhD, 2002, professor, DePaul University, CDM
College of Digital Media, DePaul University
Science Research Fellows
DePauw University
Future WorkAlthough we completed THIC’s preliminary phase and there is still much to
improve on. The current THIC implementation focuses on single-item-itemsets,
because increasing itemset size increases the computation time and amount of
overlap in groups. Another interest would be developing better ‘stopping criteria’
for the algorithm, which at the moment is based on group overlap and minimal
coverage. With a better stopping criteria, expanding to multi-item-itemsets would
be more feasible, without contradicting the philosophy of THIC.
When completed, THIC will be able to provide meaningful information in
multiple domains, including but not limited to economics, medical sciences, and
statistical analysis.
THIC produces diverse results depending on all of these preferences.
So, the focus of THIC isn’t necessarily the ‘best’ clusters/groupings, but
instead is more about producing results that can aid in understanding a data
set, such as:
Finding certain patterns that may not be evident without THIC (due to size
of dataset or complexity)
Producing results by defining ‘known’ clusters, and matching the rest of
the cases to those
Describing relationships between different features, as well as different
cases—in City Livability, cases are the cities, and features are qualities,
such as pollution and quality of education.