Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT...

19
Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay

Transcript of Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT...

Page 1: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Data warehousing and mining

Session VII (Part 1) 15:45 - 16:10

Sunita Sarawagi

School of IT, IIT Bombay

Page 2: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 2

Introduction• Organizations getting larger and amassing ever

increasing amounts of data• Historic data encodes useful information about working

of an organization.• However, data scattered across multiple sources, in

multiple formats.• Data warehousing: process of consolidating data in a

centralized location• Data mining: process of analyzing data to find useful

patterns and relationships

Page 3: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 3

Typical data analysis tasks

• Report the per-capita deposits broken down by region and profession.

• Are deposits from rural coastal areas increasing over last five years?

• What percent of small business loans were cleared?• Why is it less than last year’s? How did similar

businesses that did not take loans perform?• What should be the new rules for loan eligibility?

Page 4: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 4

Bombay branch Delhi branch Calcutta branchCensusdata

Operational data

Detailed transactionaldata

Data warehouseMergeCleanSummarize

DirectQuery

Reportingtools

MiningtoolsOLAP

Decision support tools

Oracle SAS

RelationalDBMS+e.g. Redbrick

IMS

Crystal reports Essbase Intelligent Miner

GISdata

Page 5: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 5

Data warehouse construction• Heterogeneous data integration

– merge from various sources, fuzzy matches

– remove inconsistencies

• Data cleaning: – missing data, outliers, clean fields e.g. names/addresses

– Data mining techniques

• Data loading: summarize, create indices

• Products: Prism warehouse manager, Platinum info refiner, info pump, QDB, Vality

Page 6: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 6

Warehouse maintenance

• Data refresh– when to refresh, what form to send updates?

• Materialized view maintenance with batch updates.

• Query evaluation using materialized views

• Monitoring and reporting tools– HP intelligent warehouse advisor

Page 7: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 7

Bombay branch Delhi branch Calcutta branchCensusdata

Operational data

Detailed transactionaldata

Data warehouseMergeCleanSummarize

DirectQuery

Reportingtools

MiningtoolsOLAP

Decision support tools

Oracle SAS

RelationalDBMS+e.g. Redbrick

IMS

Crystal reports Essbase Intelligent Miner

GISdata

Page 8: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 8

OLAPFast, interactive answers to large aggregate queries.• Multidimensional model: dimensions with hierarchies

– Dim 1: Bank location: • branch-->city-->state

– Dim 2: Customer:• sub profession --> profession

– Dim 3: Time:• month --> quarter --> year

• Measures: loan amount, #transactions, balance

Page 9: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 9

OLAP

• Navigational operators: Pivot, drill-down, roll-up, select.

• Hypothesis driven search: E.g. factors affecting defaulters– view defaulting rate on age aggregated over other

dimensions

– for particular age segment detail along profession

• Need interactive response to aggregate queries..

Page 10: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 10

OLAP products

• About 30 OLAP vendors

• Dominant ones:– Oracle Express: largest market share: 20%– Arbor Essbase: technology leader– Microsoft Plato: introduced late last year,

rapidly taking over...

Page 11: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 11

Microsoft OLAP strategy• Plato: OLAP server: powerful, integrating various

operational sources • OLE-DB for OLAP: emerging industry standard

based on MDX --> extension of SQL for OLAP• Pivot-table services: integrate with Office 2000

– Every desktop will have OLAP capability.

• Client side caching and calculations• Partitioned and virtual cube• Hybrid relational and multidimensional storage

Page 12: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 12

Data mining

• Process of semi-automatically analyzing large databases to find interesting and useful patterns

• Overlaps with machine learning, statistics, artificial intelligence and databases but– more scalable in number of features and instances– more automated to handle heterogeneous data

Page 13: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 13

Some basic operations

• Predictive:– Regression– Classification

• Descriptive:– Clustering / similarity matching– Association rules and variants– Deviation detection

Page 14: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 14

Classification

• Given old data about customers and payments, predict new applicant’s loan eligibility.

AgeSalaryProfessionLocationCustomer type

Previous customers Classifier Decision rules

Salary > 5 L

Prof. = Exec

New applicant’s data

Good/bad

Page 15: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 15

Classification methods

• Nearest neighbor

• Regression: (linear or any polynomial) – a*salary + b*age + c = eligibility score.

• Decision tree classifier

• Probabilistic/generative models

• Neural networks

Page 16: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 16

Clustering

• Unsupervised learning when old data with class labels not available e.g. when introducing a new product.

• Group/cluster existing customers based on time series of payment history such that similar customers in same cluster.

• Key requirement: Need a good measure of similarity between instances.

• Identify micro-markets and develop policies for each

Page 17: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 17

Association rules

• Given set T of groups of items• Example: set of item sets purchased • Goal: find all rules on itemsets of the

form a-->b such that– support of a and b > user threshold s

– conditional probability (confidence) of b given a > user threshold c

• Example: Milk --> bread• Purchase of product A --> service B

Milk, cerealTea, milk

Tea, rice, bread

cereal

T

Page 18: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 18

Mining market

• Around 20 to 30 mining tool vendors• Major players:

– Clementine,

– IBM’s Intelligent Miner,

– SGI’s MineSet,

– SAS’s Enterprise Miner.

• All pretty much the same set of tools• Many embedded products: fraud detection, electronic

commerce applications

Page 19: Data warehousing and mining Session VII (Part 1) 15:45 - 16:10 Sunita Sarawagi School of IT, IIT Bombay.

Dr. Sunita Sarawagi Data Warehousing & Mining 19

Conclusions

• The value of warehousing and mining in effective decision making based on concrete evidence from old data

• Challenges of heterogeneity and scale in warehouse construction and maintenance

• Grades of data analysis tools: straight querying, reporting tools, multidimensional analysis and mining.