Data Mining and Data Warehousing Many-to-Many Relationships
Applications
William PerrizoDept of Computer Science North Dakota State Univ.
Why Mining Data?
Parkinson’s Law of Data
Data expands to fill available storage (and then some)
Disk-storage version of Moore’s law
Capacity 2 t / 9 months
Available storage doubles every 9 months!
Another More’s Law: More is Less
The more volume, the less information. (AKA: Shannon’s Canon)
A simple illustration: Which phone book is more helpful?
BOOK-1 BOOK-2
Name Number Name NumberSmith 234-9816 Smith 234-9816Jones 231-7237 Smith 231-7237
Jones234-9816
Jones231-7237
Awash with data! US EROS Data Center archives Earth Observing System (EOS)
remotely sensed images (RSI), satellite and aerial photo data for the Government (10 petabytes by 2005).
National Virtual Observatory (aggregated astronomical data) will exceed that by many orders of magnitude.
Sensor networks will collect unheard-of data volumes (especially Nano-sensor networks).
WWW will continue to grow (and other text collections too)
Micro-arrays, gene-chips and genome sequencing are creating potentially life-saving data at a torrid pace.
Useful information must be teased out of these large volumes of data. That’s data mining.
EOS Data Mining example
TIFF image Yield Map
This dataset is a 320 row and 320 column (102,400 pixels) spatial file with 5 feature attributes (B,G,R,NIR,Y). The (B,G,R,NIR) features are in the TIFF image and the Y (crop yield) feature is color coded in the Yield Map (blue=low; red=high)
What is the relationship between the color intensities and yield? We can hypothsize:hi_green and low_red hi_yield which, while not a simply SQL query result, is not
surprising. Data Mining is more than just confirming hypotheses
The stronger rule, hi_NIR and low_red hi_yield is not an SQL result and is
surprising. Data Mining includes suggesting new hypotheses.
Another Precision Agriculture Example Grasshopper Infestation Prediction
• Grasshopper caused significant economic loss each year.
• Early infestation prediction is key to damage control.
Association rule mining on remotely sensed imagery holds significant promise to achieve early detection.
Can initial infestation be determined from RGB bands???
Gene Regulation Pathway Discovery Results of clustering may indicate, for instance, that nine
genes are involved in a pathway. High confident rule mining on that cluster may discover the
relationships among the genes in which the expression of one gene (e.g., Gene2) is regulated by others. Other genes (e.g., Gene4 and Gene7) may not be directly involved in regulating Gene2 and can therefore be excluded (more later).
Gene1Gene2, Gene3
Gene4, Gene 5, Gene6Gene7, Gene8
Gene9
Clustering
ARM
Gene2Gene1 Gene3
Gene8Gene6 Gene9
Gene5
Gene4 Gene7
Sensor Network Data Mining
Micro, even Nano sensor blocks are being developed For sensing
Bio agents Chemical agents Movements Coatings deterioration etc.
There will be millions, even billions ofindividual sensors creating mountains of data.
The data must be mined for it’s meaning. Other data that requires mining includes:
shopping market basket analysis (Walmart) Keywords in text (e.g., WWW) Properties of proteins Stock market prediction Etc. etc. etc.
Data Mining?
Querying asks specific questions and expect specific answers.
Data Mining goes into the MOUNTAIN of DATA,
and returns with information gems (rules?)
But also, some fool’s gold?
Relevance and interestingness analysis, serves as an
assay (help pick out the valuable information gems).
Data Mining versus Querying
There is a whole spectrum of techniques to get information from data:
Much work is yet to be done in optimizing Query Processing (D. DeWitt, ACM SIGMOD’02).
On the Data Mining end, the surface has barely been scratched.
But even those initial scratches have had an great impact – e.g., between becoming the biggest
corporation in the world and filing for bankruptcy Walmart vs. KMart
SQLSELECTFROMWHERE
Complex queries(nested, EXISTS..)
FUZZY query,Search engines,BLAST searches
OLAP (rollup, drilldown, slice/dice..
Machine Learning Data Mining Standard querying Searching and Aggregating
Supervised Learning – Classificatior Regression
Unsupervised Learning - Clustering
Association Rule Mining
Data Prospecting
Fractals, …
Data Mining
Data mining: the core of the knowledge discovery process.
Data Cleaning/Integration:missing data, outliers,noise, errors
Raw Data
Data Warehouse: cleaned, integrated, read-only, periodic, historical raw database
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
OLAPClassificationClusteringARM
Feature extraction, tuple selection
Our Approach A new compressed, datamining-ready, data structure, the Peano-tree (Ptree)1 which
processes vertical data horizontally Whereas, standard RDBMSs process horizontal data vertically)
Ptrees facilitate data mining Ptrees address curses of scalability and dimensionality.
A new compressed, OLAP-ready data warehousing structure, Peano Data Cube (PDcube)
Facilitates OLAP operations and query processing. Fast logical operations on Ptrees are used.
1 Technology is patent pendingby North Dakota State University
A table, R(A1..An), is a horizontal
structure (set of horizontal records)
processed vertically (vertical scans)
Vertical structure processed horizontally (ANDs)
Ptrees fully vertical partitions, then compress each bit file into a basic Ptree, then horizontally process these Ptrees using a multi-operand logical AND.
010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100
R( A1 A2 A3 A4) --> R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 101 100111 000 001 100
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 1 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
HorizontalstructureProcessedvertically(scans)
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10
0 1 0 0 1 01
0 0 00 0 0 1 01 10
0 1 0
0 1 0 1 0
0 0 01 0 01
0 1 0 0 0 10
0 0 0 1 0
0 0 10 1
0 0 10 1 01
0 0 00 1 01
0 0 0 0 1 0 010 01
1-D Pure1 Ptrees are formed by recursively halving the bit vector and recording 1 at a node iff that half is purely 1-bits
PEANO TREES (Ptrees)
Ptrees are run-compressed, lossless representations of the data. Ptrees can be 1-dimensional (recursively halving the bit file) Ptrees can be 2-dimensional (recursively quartering – e.g., for
images), 3-dimensional, …
The most useful form of a Ptree is the predicate-Ptree (1-bit at a node iff the corresponding half (or quadrant or…) satisfies a predicate, e.g., Pure1 Ptree which has a 1-bit at a node iff the corresponding half is purely 1s (previous slide) and NonPure0 Ptree,1 iff not pure 0s.
A 2-D P1tree
0
1 0 0 0
0 0 1 0 1 1 0 1
1 1 1 0 0 0 1 0 1 1 0 1
0
01 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0
1 0
0 1 1 1 1
1 1 1 0
0
0 0 1 0
0
1 1
0
0 1
0
2-D Pure1 tree node: 1 iff that sub-quadrant is purely 1-bits
One of the bit files from a raster ordered spatial dataset (e.g., an image)1111110011111000111111001111111011110000111100001111000001110000
A Count PtreeComputing a counts is usually the ultimate goal in data mining (we use P1trees instead of count trees because they are more compressed and can produce the needed counts quite quickly).
Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count
Level Fan-out QID (Quadrant ID)
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
0 1 2 3
111
( 7, 1 ) ( 111, 001 ) 10.10.11
2
3
2 . 2 . 3
001
55
16 8 15 16
3 0 4 1 4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
NP0tree
NP0tree: Node=1 iff that sub-quadrant is not purely 0s. NP0 and P1 are examples of <predicate>trees: node=1 iff sub-quadrant satisfies <predicate>
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0
1
1 1 1 0
1 0 1 1 1 1 1 1
1 1 1 0 0 0 1 0 1 1 0 1
Logical Operations on P-trees
Operations are level by level There are shortcuts
E.g., We only need to load quadrant with Qid 2 for ANDing NP0-tree1 and NP0-tree2.
The choice of 1-D, 2-D, … and the ordering, can be chosen to optimize compression and/or processing speed.
1-D Ptrees: Compression Aspect
P-Trees: Ordering Aspect
The Compression relies on long sequences of 0 or 1 Therefore, for images, neighboring pixels are more
likely to be similar using Peano-ordering (space filling curve) than raster ordering.
Other data? Peano-ordering can be generalized Peano-order sorting of attributes to maximize
compression.
1-D Peano-Order Sorting
Impact of Peano-Order Sorting
Impact of Sorting on Execution Speed
0
20
40
60
80
100
120
adult
spam
mus
hroo
m
func
tion
crop
Tim
e in
Sec
on
ds Unsorted
Simple Sorting
Generalized PeanoSorting
0
20
40
60
80
0 5000 10000 15000 20000 25000 30000
Number of Training Points
Tim
e p
er T
est
Sam
ple
in
Mill
isec
on
ds
Speed improvement especially for large data sets
Less than O(N) scaling for all algorithms
Many-to-Many (M-M) Relationships
Tables are M-M (-M-M…-M) relationships of domain (entity) elements
Graphs are M-M self relationships between an entity and itself
Protein-Protein interactions Customer-customer interactions
“Everything should be made as simple as
possible, but not simpler”
Albert Einstein
Claim: Representation as single relation is not rich enough Example:
Contribution of a graph structure to standard mining problems Genomics
Protein-protein interactions
WWW Link structure
Scientific publications Citations
Scientific American 05/03
Data on a Graph
Common Topics Analyze edge structure
Google Biological Networks
Sub-graph matching Chemistry
Visualization Focus on graph structure
Our work Focus on mining node data Graph structure provides connectivity
Protein-Protein Interactions Protein data
From MIPS (Munich Information Center for Protein Sequences)
Hierarchical attributes Function Localization Pathways
Gene-related properties Interactions
From experiments Undirected graph
glyph
Questions Prediction of a property
(KDD-cup 02: AHR*) Which properties in
neighbors are relevant? How should we integrate
neighbor knowledge? What are interesting
patterns? Which properties say
more about neighboring nodes than about the node itself?
But not:
*AHR: Aryl Hydrocarbon Receptor Signaling Pathway
AHR
Possible Representations OR-based
At least one neighbor has property Example: Neighbor essential true
AND-based All neighbors have property Example: Neighbor essential false
Path-based (depends on maximum hops) One record for each path Classification: weighting? Association Rule Mining:
Record base changes
essential
AHR essential
AHR not essential
Association Rule Mining OR-based representation Conditions
Association rule involves AHR Support across a link greater than within a
node Conditions on minimum confidence and support Top 3 with respect to support:
(Results by Christopher Besemann, project CSci 366)
AHR essential
AHR nucleus (localization)
AHR transcription (function)
Classification Results Problem
(especially path-based representation) Varying amount of information per record Many algorithms unsuitable in principle
E.g., algorithms that divide domain space
KDD-cup 02 Very simple additive model Based on visually identifying relationship Number of interacting essential genes adds to
probability of predicting protein as AHR
KDD-Cup 02: Honorable Mention
NDSU Team
Top Related