Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality:...

12
Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume. "files are too deep!" Curse of dimensionality: solutions don’t scale with respect to attribute dimension. "files are too wide!" The curse of cardinality is a problem in both the horizontal and vertical data worlds! In the horizontal data world it was disguised as “curse of slow joins”. In the horizontal world we decompose relations to get good design (e.g., 3 rd normal form), but then we pay for that by requiring many slow joins to get the answers we need.

Transcript of Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality:...

Page 1: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Vertical Data

In Data Processing, you run up against two curses immediately.

Curse of cardinality: solutions don’t scale well with respect to record volume."files are too deep!"

Curse of dimensionality: solutions don’t scale with respect to attribute dimension."files are too wide!"

The curse of cardinality is a problem in both the horizontal and vertical data worlds!

In the horizontal data world it was disguised as “curse of slow joins”.In the horizontal world we decompose relations to get good design

(e.g., 3rd normal form), but then we pay for that by requiring many slow joins to get the answers we need.

Page 2: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Horizontal Processing of Vertical Data or HPVD, instead of the ubiquitous Vertical Processing of Horizontal (record orientated) Data or VPHD.

Parallelizing the processing engine. Parallelize the software engine on clusters of computers.

Parallelize the greyware engine on clusters of people

(i.e., enable visualization and use the web...).

Again, we need better techniques for data analysis, querying and mining because of:Parkinson’s Law: Data volume expands to fill available data storage.

Moore’s law: Available storage doubles every 9 months!

Techniques to address these curses.

Page 3: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Yield prediction: Using Remotely Sensed Imagery (RSI) consists of an aerial photograph (RGB TIFF image taken ~July) and a synchronized crop yield map taken at harvest; thus, 4 feature attributes (B,G,R,Y) and ~100,000 pixels.

A stronger association, “hi_NIR & low_redhi_yield”,found through HPVD data mining), allows producers to take and query mid-season aerial photographs for low_NIR & high_red grid cells, and where low yeild is anticipated, apply (top dress) additional nitrogen.Can producers use Landsat images of China of predict wheat prices before planting?

A few HPVD successes: 1. Precision Agriculture

TIFF image Yield Map

2. Infestation Detection (e.g., Grasshopper Infestation Prediction - again involving RSI)

Grasshopper caused significant economic loss each year.

Early infestation prediction is key to damage control.

Pixel classification on remotely sensed imagery holds much promise to achieve early detection. Pixel classification (signaturing) has many, many applications: pest detection, Flood monitoring, fire detection, wetlands monitoring …

Producer are able to analyze the color intensity patterns fromaerial and satellite photos taken in mid season to predict yield(find associations between electromagnetic reflection and yeild).E.g., ”hi_green & low_red hi_yield”. That is very intuitive.

Page 4: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

3. Sensor Network Data HPVD

Micro and Nano scale sensor blocksare being developed for sensing

Biological agents Chemical agents Motion detection coatings deterioration RF-tagging of inventory (RFID tags for Supply Chain Mgmt) Structural materials fatigue

There will be trillions++ of individual sensors creating mountains of data which can be data mined using HPVD (maybe it shouldn't be called a success yet?).

Page 5: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

4. A Sensor Network Application:

Each energized nano-sensor transmits a ping (location is triangulated from the ping). These locations are then translated to 3-dimensional coordinates at the display. The corresponding voxel on the display lights up. This is the expendable, one-time, cheap sensor version.

A more sophisticated CEASR device could sense and transmit the intensity levels, lighting up the display voxel with the same intensity.

Wherever threshold level is sensed (chem, bio, thermal...)a ping is registered in a compressed structure (P-tree – detailed definition coming up) for that location.

Situation space

Nano-sensors droppedinto the Situation space

Soldier sees replica of sensedsituation prior to entering space

.:.:.:.:..::….:. : …:…:: ..:

. . :: :.:…: :..:..::. .:: ..:.::..

.:.:.:.:..::….:. : …:…:: ..:

. . :: :.:…: :..:..::. .:: ..:.::..

.:.:.:.:..::….:. : …:…:: ..:

. . :: :.:…: :..:..::. .:: ..:.::..

Using Alien Technology’s Fluidic Self-assembly (FSA) technology, clear plexiglass laminates are joined into a cube, with a embedded nano-LED at each voxel.

==================================\ CARRIER /

CubE for Active Situation Replication (CEASR)

The single compressed structure (P-tree) containing all the information is transmitted to the cube, where the pattern is reconstructed (uncompress, display).

Page 6: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

3. Anthropology Application

Digital Archive Network for Anthropology (DANA)(analyze, query and mine arthropological artifacts (shape, color, discovery location,…)

Page 7: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

What has spawned these successes?(i.e., What is Data Mining?)Querying is asking specific questions for specific answers

Data Mining is finding the patterns that exist in data (going into MOUNTAINS of raw data for the

information gems hidden in that mountain of data.)

Raw data must be cleaned of: missing items, outliers,noise, errors

Data Warehouse: cleaned, integrated, read-only, periodic, historical database

Data Mining

Pattern Evaluation and Assay

ClassificationClusteringRule MiningTask-relevant Data

SelectionFeature extraction, tuple selection

visualization

Loopbacks

Smart files

Page 8: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Data Mining versus Querying

Even on the Query end, much work is yet to be done (D. DeWitt, ACM SIGMOD Record’02).

On the Data Mining end, the surface has barely been scratched.But even those scratches have had a great impact. For example, one of the early scatchers became the

biggest corporation in the world. A Non-scratcher had to file for bankruptcy protection.

SQLSELECTFROMWHERE

Complex queries(nested, EXISTS..)

Standard querying

FUZZY query,Search engines,BLAST searches

OLAP (rollup, drilldown, slice/dice..

Searching and Aggregating Machine Learning Data Mining

Supervised Learning – classification regression

Unsupervised Learning - clustering

Walmart vs. KMart

There is a whole spectrum of techniques to get information from data:

Association Rule Mining

Data Prospecting

Fractals, …

HPVD Approach:HPVD Approach: Vertical,Vertical, compressed data structures, Predicate-trees or Peano-trees (Ptrees in either case)1 processed horizontally horizontally (Most DBMSs process horizontal data horizontal data verticallyvertically))

Ptrees are data-mining-ready, compressed data structures, which attempt to address the curses of cardinality and curse of dimensionality.

1 Ptree Technology is patentedby North Dakota State University

Page 9: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

0 0 0 0 1

P11

4. Left half of rt half ? false0 00 0 0

2. Left half pure1? false 0

00 0

1. Whole is pure1? false 0

5. Rt half of right half? true1

00 0 0 1

R11 0 0 0 0 0 0 1 1

To find the number of occurences of 7 0 1 4, AND these basic Ptrees (next slide)

Predicate trees (Ptrees): vertically project each attribute,

Given a table structured into horizontal records. (which are traditionally processed vertically - VPHD )

Top-down construction of the 1-dimensional Ptree of R11, denoted, P11:

Record the truth of the universal predicate pure 1 in a tree recursively on halves (1/21 subsets),until purity is achieved.

3. Right half pure1? false 0 00 0

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 10 1 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100

But it is pure (pure0) so this branch ends

then vertically project each bit position of each attribute,then compress each bit slice into a basic 1D Ptree. e.g., compression of R11 into P11 goes as follows:

P11

pure1? false=0

pure1? false=0

pure1? false=0pure1? true=1

pure1? false=0

R(A1 A2 A3 A4)2 7 6 16 7 6 03 7 5 12 7 5 73 2 1 42 2 1 57 0 1 47 0 1 4

for Horizontally structuredrecords

Scan vertically

010 111 110 001011 111 110 000010 110 101 001010 111 101 111011 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

Base 10 Base 2

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1

1

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^

1-Dimensional Ptrees

VPHD to find the number of occurences of 7 0 1 4 =2HPVD to find the number of occurences of 7 0 1 4?

Page 10: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

To count occurrences of 7,0,1,4 use 111000001100: 0 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0

01 ^

7 0 1 4

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^ ^ ^

R(A1 A2 A3 A4)2 7 6 13 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 4

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

This 0 makes entire left branch 0These 0s make this node 0 These 1s and these 0s (which when

complemented are 1's) make this node 1

The 21-level has the only 1-bit so 1-count = 1*21 = 2

# change

Page 11: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

R11 0 0 0 0 1 0 1 1

Top-down construction of basic P-trees is best for understanding, bottom-up is much faster (once across).

Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal, collapsing of pure siblings as we go:

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

P11

0 0

0

0 0

0

1 0

0

0

0

1 1

1

0

Siblings are pure0so callapse!

Page 12: Vertical Data In Data Processing, you run up against two curses immediately. Curse of cardinality: solutions don’t scale well with respect to record volume.

Thank you.