Post on 08-Jun-2015
description
Paper PDF: http://www.icmc.usp.br/pessoas/junio/PublishedPapers/RodriguesJr_et_al-IV2010.pdf
Institute of Mathematics and Computer SciencesUniversity of Sao Paulo (USP) - Brazil
London, UK - July/2010
Combining Visual Analytics and Content Based Data Retrieval Technology for Efficient Data Analysis
Jose F Rodrigues Jr – junio@icmc.usp.br
- 14th International Conference on Information Visualisation (IV 2010) -
Context and Context and problems:problems:
content-based content-based data retrieval data retrieval
and and visualizationvisualization
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Content-based data retrieval
Metric spaces: de facto solution for non-orderable data domains Multivariate data Image Video Sound Text
Metric spaces Content-based Data Retrieval It demands three things:
1.Features extraction/annotation2.A distance function
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Content-based data retrievalFeatures extraction/annotation
Features extraction transforms data objects into multivariate data
Features annotation translates real world objects into multivariate data
- 14th International Conference on Information Visualisation (IV 2010) -
Distance functions measure the distance in between features vectors
L2=Euclidiana
r
L0=LInfinity=Chebychev
L1=Manhatan
Context: Content-based data retrieval Distance functions
- 14th International Conference on Information Visualisation (IV 2010) -
Distance FunctionFeatures/Multivariate Data
Context: Content-based data retrieval
- 14th International Conference on Information Visualisation (IV 2010) -
Distance FunctionFeatures/Multivariate Data
Context: Content-based data retrieval
Similarity QueriesSimilarity Queries
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Metric spaces
Metric Structure
Distance Function
Features ExtractionFeatures ExtractionContent-based data retrieval
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Metric spaces
Metric Structure
Distance Function
Features ExtractionFeatures ExtractionContent-based data retrieval
Data retrieval
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Metric spaces
Problem 1 - it is difficult to make sense of metric spaces:High dimensionalityToo many data itemsSimilarity queries work like a black box:
no opportunity to check the context of what was retrieved
nor what else could have been retrieved
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Visualization
The goal of this conference
Important aid for data analysis
Interactive
Intuitive
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Visualization
Visualization mantra:overview first, zoom & filter, then details on
demand
Hence: filtering is an essential part of visualization
The principal means for visual filtering is Interactive brushing
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Visualization
Problem 2 - interactive brushing may be inefficient:Overlap of graphical elementsToo many data elements to select one at a timeSelection based on geometrical primitives (as
spheres and cubes) will brush more elements than what is desired
- 14th International Conference on Information Visualisation (IV 2010) -
Context: Visualization
Problem 3 - the user’s interests may yield filterings that cannot be accomplished with brushing:Semantical interests lead to data items that are:
Not equally importantNot visually adjacentNot at the same analytical context
- 14th International Conference on Information Visualisation (IV 2010) -
Methodology:Methodology:
Why not put Why not put together together
content-based content-based data retrieval data retrieval
and and visualization?visualization?
- 14th International Conference on Information Visualisation (IV 2010) -
Linking CBDR and visualization
Metric spaces are properly spatial
Visualizations use space as one of its main channels for data coding
- 14th International Conference on Information Visualisation (IV 2010) -
Linking CBDR and visualization
?
CBDR Visualization
- 14th International Conference on Information Visualisation (IV 2010) -
Linking CBDR and visualization
- 14th International Conference on Information Visualisation (IV 2010) -
Linking CBDR and visualization
Slim-Tree - Family of Metric Access Methods(Traina Jr. et al. 2000)
- 14th International Conference on Information Visualisation (IV 2010) -
Linking CBDR and visualization
Multidimensional Projection with Fastmap algorithm(Faloutsos and K.-I. Lin 1995)
Slim-Tree - Family of Metric Access Methods(Traina Jr. et al. 2000)
- 14th International Conference on Information Visualisation (IV 2010) -
Methodology:Methodology:
content-based content-based data retrieval viadata retrieval via
Slim-TreeSlim-Tree
- 14th International Conference on Information Visualisation (IV 2010) -
Slim-Tree Metric Access Method
Goal: performance in similarity queries Challenge: multidimensional data, and queries based on
similarity rather than equality Demands: a distance function satisfying triangular
inequality, symmetry and non-negativity Principle: use of progressively hierarchical (tree)
restricting radiuses around representative objects Engineer: clustering subtrees and use of trigonometry
for branch pruning during search Mechanism: dynamic, balanced, bottom-up, disk-
oriented Data: representatives at inner nodes, objects at the leaves
- 14th International Conference on Information Visualisation (IV 2010) -
Video: Metric-Tree-Example-faster.avi
Slim-Tree Metric Access Method
- 14th International Conference on Information Visualisation (IV 2010) -
Oq
• Example: Range Query
Query object Oq and a query radius rq
Slim-Tree Metric Access Method
Oj
Op
Okrepr0(O )p
repr1(O )p
obj(Op)
Oi
Ol Om
repr1(O )i
repr (O1 j)
obj(O m)obj(Ok)
obj(O i)obj(O l)
obj(O j)
...
......
- 14th International Conference on Information Visualisation (IV 2010) -
Methodology:Methodology:
visualization viavisualization via
FastMapFastMap
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap
Projection of p-dimensional points in k mutually orthogonal directions; p > k
Imagine that the points are in a Euclidean space Select two pivot points xa and xb that are far apart Compute a pseudo-projection of the remaining points
along the “line” xaxb
“Project” the points to an orthogonal subspace and recurse
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: Selecting the Pivot Points
Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)
Select any point x0
Let x1 be the furthest from x0
Let x2 be the furthest from x1
Return (x1, x2)x0
x2
x1
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: Selecting the Pivot Points
Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)
Select any point x0
Let x1 be the furthest from x0
Let x2 be the furthest from x1
Return (x1, x2)x0
x2
x1
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: Selecting the Pivot Points
Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)
Select any point x0
Let x1 be the furthest from x0
Let x2 be the furthest from x1
Return (x1, x2)x0
x2
x1
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: Selecting the Pivot Points
Find the two-furthest points O(n2) Use heuristic 2*(n-1) O(n)
Select any point x0
Let x1 be the furthest from x0
Let x2 be the furthest from x1
Return (x1, x2)x0
x2
x1
XaXb
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: Pseudo-Projections
Given pivots (xa , xb ), and any third point y
From the law of cosines, one can get to:
The pseudo-projection for y is
This is the first of k coordinates to compute
xa
xb
y
cy da,y
db,y
da,b 2 2 2 2by ay ab y abd d d c d
2 2 2
2ay ab by
yab
d d dc
d
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: “Project to orthogonal plane”
For orthogonality, the points are considered as if projected on a hyperplane perpendicular to XaXb;
hence p becomes p-1 The projection is not done properly
said, but the distance function for such configuration is determined
We can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem
2 2'( ', ') ( , ) ( )z yd y z d y z c c
xb
xa
y
z
y’ z’dy’,z’
dy,z
cz-cy
Get back to the previous slide and use d ’ to calculate the next cy
coordinate; repeat until k coordinates are computed
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: “Project to orthogonal plane”
For orthogonality, the points are considered as if projected on a hyperplane perpendicular to XaXb;
hence p becomes p-1 The projection is not done properly
said, but the distance function for such configuration is determined
We can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem
2 2'( ', ') ( , ) ( )z yd y z d y z c c
xb
xa
y
z
y’ z’dy’,z’
dy,z
cz-cy
Get back to the previous slide and use d ’ to calculate the next cy
coordinate; repeat until k coordinates are computed
More important than what it does, is how it does; and this step is the main engineer behind FastMap – it allows it to work recursively and fast.
- 14th International Conference on Information Visualisation (IV 2010) -
FastMap: “Project to orthogonal plane”
For orthogonality, the points are considered as if projected on a hyperplane perpendicular to XaXb;
hence p becomes p-1 The projection is not done properly
said, but the distance function for such configuration is determined
We can compute distances within the “orthogonal hyperplane” using the Pythagorean theorem
2 2'( ', ') ( , ) ( )z yd y z d y z c c
xb
xa
y
z
y’ z’dy’,z’
dy,z
cz-cy
Get back to the previous slide and use d ’ to calculate the next coordinate; repeat until k coordinates are computed
• Good point: overall complexity is k*n O(n), better than any other
• Bad point: original distances are not strictly maintained stress (like any other technique of its kind)
• All in all, FastMap is a top five multidimensional projection technique (along with MDS, PCA, LSP, FDP), but with the best performance
- 14th International Conference on Information Visualisation (IV 2010) -
Solving the stated Solving the stated problems:problems:
innovating withinnovating with
MetricSPlatMetricSPlat
- 14th International Conference on Information Visualisation (IV 2010) -
Breast Cancer Wisconsin (Diagnostic) Data Set• 9 descriptive attributes annotated from tissue biopsies:
1. ClumpThickness2. UniforSize3. UniforShape4. MargAdhes5. SingleEpithSize6. BareNuclei7. BlandChromatin8. NormalNucleoli9. Mitoses
• Breast Cancer classification:• benign (class 0) • malign (class 1)
Demonstration dataset
- 14th International Conference on Information Visualisation (IV 2010) -
Problem 1
Problem 1 - it is difficult to make sense of metric spaces
Choose a distance function, which will be the same for FastMap and SlimTree
Build a SlimTreeChoose an element of interest to be the center of the
projectionRetrieve the entire dataset out of the SlimTreeFastMap-project all the elements of the dataset in 3DUse central point for translation
- 14th International Conference on Information Visualisation (IV 2010) -
Video: Problem1-final.avi
- 14th International Conference on Information Visualisation (IV 2010) -
Problem 2
Problem 2 - interactive brushing may be inefficient Data selections may demand too many brushing steps Attribute by attribute selection may not correspond to
similarity as expected by the user In 3D projection, spheres and cubes do not apply for non-
regular regions of interest Due to stress, projection does not exactly correpond to
the original data configuration Approach:
Brush one element of interestPerform a KNN/Range query having this element as
the query center
- 14th International Conference on Information Visualisation (IV 2010) -
Video: Problem2-final.avi
- 14th International Conference on Information Visualisation (IV 2010) -
Problem 3
Problem 3 - the user’s interests may yield filterings that cannot be accomplished with brushingWeight attributes according to perception and
knowledge domainRedefine metric space and projectionPropagate data to multivariate visualization
techniques
- 14th International Conference on Information Visualisation (IV 2010) -
Video: Problem3-final.avi
- 14th International Conference on Information Visualisation (IV 2010) -
Without weighting of attributes 2 and 3 After weighting of attributes 2 and 3
KNN(15, 441) = <445, 444, 449, 433, 450, 448, 447, 446, 443, 442, 440, 439, 438, 437, 436> = R
KNN(15, 441) = <456, 454, 453, 452, 450, 449, 448, 447, 443, 441, 436, 433, 430, 425, 424> = R´
- 14th International Conference on Information Visualisation (IV 2010) -
Without weighting of attributes 2 and 3 After weighting of attributes 2 and 3
KNN(15, 441) = <445, 444, 449, 433, 450, 448, 447, 446, 443, 442, 440, 439, 438, 437, 436> = R
KNN(15, 441) = <456, 454, 453, 452, 450, 449, 448, 447, 443, 441, 436, 433, 430, 425, 424> = R´
- 14th International Conference on Information Visualisation (IV 2010) -
Without weighting of attributes 2 and 3 After weighting of attributes 2 and 3
KNN(15, 441) = <445, 444, 449, 433, 450, 448, 447, 446, 443, 442, 440, 439, 438, 437, 436> = R
KNN(15, 441) = <456, 454, 453, 452, 450, 449, 448, 447, 443, 441, 436, 433, 430, 425, 424> = R´
The query answer is 60% different:
R’ - R = {456, 454, 453, 452, 450, 441, 430, 425, 424}R ∩ R’ = {449, 448, 447, 443, 436, 433}
It retrieves elements whose attributes 2 and 3 are more similar to the query center 441 at attributes 2 and 3.
- 14th International Conference on Information Visualisation (IV 2010) -
References
C. Faloutsos and K.-I. Lin, “FastMap: A Fast Algorithm for Indexing, Data-Mining and Visualization of Traditional and Multimedia Datasets”, Proc. ACM SIGMOD, 1995, 163-174
C. Traina Jr., A. Traina, B. Seeger, C. Faloutsos, “Slim-trees: High Performance Metric Trees Minimizing Overlap Between Nodes”, Int. Conference on Extending Database Technology (EDBT), 2000, 51--65
- 14th International Conference on Information Visualisation (IV 2010) -
Resources
D. Mount, “Feature Selection”, Presentation for CMSC 828K - University of Maryland - http://www.kanungo.com/teaching/cmsc828K/dave/feature.ppt- accessed 07/2010
Tomáš Skopal, Jaroslav Pokorný, Michal Krátký, Václav Snášel, “Revisiting M-tree Building Principles”, Presentation for Advances in Databases and Information Systems 2003
http://siret.ms.mff.cuni.cz/skopal/pres/mtree.ppt - accessed 07/2010
Clemens Marschner, “MTree Tester Applet”
http://www.cmarschner.net/mtree.html - accessed 07/2010
Texas A&M University, “Function Plotter”
http://www.math.tamu.edu/AppliedCalc/Classes/Plot2/CardPlot.html
- 14th International Conference on Information Visualisation (IV 2010) -
Thank youThank youhttp://www.icmc.usp.br/~junio
Link “MetricSPlat”
junio@icmc.usp.br