1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out...

6
1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto Ochandio DSCI 5240 November Dec 7, 2005

Transcript of 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out...

Page 1: 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto.

1

Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

Written By: Putten, Kok, Gupta

Presented By: Ernesto OchandioDSCI 5240November Dec 7, 2005

Page 2: 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto.

2

Problem Definition

• Exponential growth in data capture leads to data fragmentation.– POS customer tracking– Corporate Data Warehouse– Advanced Analytics

• Increased popularity of personalized messages.• Prohibitive attitudinal data costs.

Page 3: 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto.

3

Data Fusion Overview

• Data Fusion is the combination of information from different sources.

• Also known as: Micro Data Set Merging, Statistical Record Linkage, and Multi-Source Imputation

• Example: – Demographic and psychographic data aggregated at

geographical level.– Same characteristics for people in the same region.

• Motivation:– Algorithms can create generalized fusions providing richer

data sets for use in applications or future data mining projects.

Page 4: 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto.

4

Data Fusion Terminology

• Recipient, Donor, Fused Variables, Common Variables, Critical Common Variables

+ =

Recipient Donor Fused Dataset

Common Variables Fused Variables

Page 5: 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto.

5

C1 X, Y, Z 15 15 15 15C2 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC3 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC4 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC5 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC6 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC7 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC8 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC9 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC10 X, Y, Z 20 20 20 20C11 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC12 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC13 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC14 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC15 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC16 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC17 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC18 X, Y, Zxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Data Fusion Algorithm

• Find best Donor elements that match the Recipient element.• Ensure Critical Variable exact match.• Limit Donor element usage.• Use averages from the Donor set to estimate the Fused variables

for the Recipient set.

+ =

Recipient Donor Fused Dataset

C1 X, Y, ZC2 X, Y, ZC3 X, Y, ZC4 X, Y, ZC5 X, Y, ZC6 X, Y, ZC7 X, Y, ZC8 X, Y, ZC9 X, Y, ZC10 X, Y, ZC11 X, Y, ZC12 X, Y, ZC13 X, Y, ZC14 X, Y, ZC15 X, Y, ZC16 X, Y, ZC17 X, Y, ZC18 X, Y, Z

X, Y, Z 10 10 10 10X, Y, Z 20 20 20 20X, Y, Z 10 10 10 10X, Y, Z 20 20 20 20X, Y, Z 30 30 30 30

Page 6: 1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto.

6

Conclusion

• Data Fusion increases the value of Data Mining by creating more data to mine while reducing costs and ensuring the best matches possible without over-representing elements in the Donor set.