Gio Wiederhold PDM 1 Profiting from Data Mining Gio Wiederhold November 2003.

34
Gio Wiederhold PDM 1 Profiting from Data Mining Gio Wiederhold November 2003
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    0

Transcript of Gio Wiederhold PDM 1 Profiting from Data Mining Gio Wiederhold November 2003.

Gio Wiederhold PDM 1

Profiting from Data Mining

Gio Wiederhold

November 2003

Gio Wiederhold PDM 2

Steps needed to profit

1. Obtaining relevant data– Always incomplete

2. Extracting relationships– Imputing causality

3. Finding applicability– Determining leverage points

4. Inventing candidate actions– Assessing likely outcomes and benefits

5. Selecting action to be taken– Measuring the outcome

Collecting data for next round

?Model

based

Gio Wiederhold PDM 3

Today's Problem: Disjointness

1. Database administrators• Focus on data collection, organization, currency

2. Analysts• Focus on slicing, dicing, relationships

3. Middle managers• Focus on their costs, profits

4. MBAs• Focus on business models, planning

5. Executives• Must make decisions based on diverse inputs

Gio Wiederhold PDM 4

1. Data Collection

Two choices1. (rare) Collect data specifically for analysis

allows careful design -- model causes and effects

Purchase = f(price, color, size, custumer inc., gender,. ,, costly often small to make collection manageable

imposes delays2. (common) Use data collected for other purposes

take advantage of what is readily available low cost filtering, reformatting, integration

incomplete - rarely covers all causes / effects biased -- missing categories

only people with phones, cars -- shopping in super markets

Gio Wiederhold PDM 5

1a. Data Integration

Needed when sources have inadequate coverage

• in distinct DBs for– Prices, Number purchased – Customer segments (supermarket, stores, on-line)

implies some expectations

append attributes where keys match: Joe

include semantic match Joe = 012 34 567

append rows where key types match: customer

include semantic match customer = owner

Gio Wiederhold PDM 6

2. Data analyis

• Find relationships– already known - ignore or adjust in next round

» requires comparison with expert knowledge» now have quantification

– unknown» uninteresting per expert» interesting per expert

Gio Wiederhold PDM 7

3. Establish causality

• Already known -- Prior Model – But is it complete, i.e., does it explain all effects ?

• Analyze relationships– use expertise to decide direction

» often obvious "common world knowledge"

» sometimes ambiguoussmoking Cancer not-smoking

» often major true cause not captured in datafood color 10%, food price 20%, buyer gender 2% unknown 75%guess: ethnicity, income

purchase of Chinese vs other food

invent surrogates: names, ZIP codes,

use temporal information

Gio Wiederhold PDM 8

Establishing causality is risky

1. Is a Volvo a safe car?

2. What causes accidents? Drivers!

3. Who buys Volvos?

4. Must determine• effect of safe drivers• percentage of safe drivers overall• percentage of safe drivers with Volvos

5. How much of the accident rate is now explained?

The unexplained difference can be attributed to the car.

Careful drivers!

Mined: Volvos have fewer accidents

Gio Wiederhold PDM 9

Change causecreate effects

To use results of data mining

• have to understand direction of relationships

interesting beneficialeffects

side effects

side effectscontrollablecauses

externalcauses

hiddencaptured by data

Model

Gio Wiederhold PDM 10

4. Causes provide the leverage

Language of analyst / Language of modeling

• Many causes -- independent variables– A few may be controllable– Some may be controlled by our competition– Others are forces-of-nature

• Even more effects -- dependent variables– A few may be desired– Some may be disastrous– Many are poorly understood

• Intermediate effects – Provide a means for measuring effectiveness– Allow correction of actions taken

Gio Wiederhold PDM 11

5. Planning & Assessment

Analyze Alternatives

• Current Capabilities

• Future Expectations

Process tasks:

• List resources

• Enumerate alternatives

• Prune alternative

• Compare alternatives

now

Predict

the

future

Gio Wiederhold PDM 12

Prediction Requires Tools

E-mail this book, Alfred Knopf, 1997

Gio Wiederhold PDM 13

Simulations predict

1. Back-of-the-envelope• Common• Adequate if model is simple• Assumptions are easily forgotten after some time,

not distinguished from data "Why are we doing this"

2. Spreadsheets• Most common computing tool• Specialist modeler can help• New, recent data can be pasted in • Awkward for the tree of future alternatives

3. Constructed to order• Costly, powerful technology• Specialist modelers required• Expressive simulation languages• Requires specialists to set up, run, and rerun with new data

Iv gH Xy mN

DM

Gio Wiederhold PDM 14

Simulation results: likelihoods

timetime

Next period alternatives

uncertainty increases

and subsequent periods

0.4

0.60.18

0.15

0.13

0.25

0.2

0.17

0.4

0.3

0.19

nownow

0.1

0.11

0.12

0.3

Gio Wiederhold PDM 15

Simulation services

Wide variety, but common principle

Inputs Model Output (time, $, place, ...)

1. Spreadsheets

Identify independent, controlable, and resulting values

2. Execution specific to query: what-if assessment– may require HPC power for adequate response

3. Continously executing: weather prediction– Search for best match ( location, time )

4. Past simulations results collected for future useTypically sparse -- the dimension of the futures is too large:

– Tables in a design handbook: materialsPerform inter- or extra-polations to match query parameters

Gio Wiederhold PDM 16

6. Specify Value of Effects

Still needed: Value of alternative outcomes• Decision maker / owner input

– Benefits and Costs– Potential Profit

– Correct for risk, and adjust to present value

past now futurespast now futures

10001000

20002000

50005000

10001000

00

-2000-2000

-6000-6000

ValuesValues

timetime

Gio Wiederhold PDM 17

Having it all together

• Relationships from analyses of past data

• Data representing the current state

• List of actionable alternatives

• Tree of subsequent alternatives

• Probabilities of those alternatives

• Values of the outcomes

• Ability to predict the likelihood of futures

0.4

0.60.18

0.15

0.13

0.25

0.2

0.17

0.4

0.3 0.19 0.1

0.11

0.12

0.3 10001000

20002000

50005000

10001000

00

-2000-2000

-6000-6000

ValuesValues

Gio Wiederhold PDM 18

Vision: Putting it all together

Combine results mined from past data, current observations, and predictions into the future.

o o

o oo o

timetime

Support specialistsSupport specialists

Decision MakerDecision Maker

Gio Wiederhold PDM 19

Needed: Information Systems that alsoproject seamlessly into the Futures

Support of decision-making requires dealing with the futures, as well the past

• Databases deal well with the past

• Streaming sensors supply current status

• Spreadsheets, simulations deal with the likely futures

Future information systems should combine all these sources

timetimepast now futurepast now future

Gio Wiederhold PDM 20

Connecting it all

Build super systems• Coherent, consistent

• Expensive

• Unmaintainable

• Too many cooks: – Database folk– Data miners– Analysts– Planners– Simulation specialists– Decision makers

Develop interfaces• Incremental

• Composable as needed

• Heterogeneous

• Interfaces required: Metadata– Database to miners: SQL

– Mined results to analysts: XML?

– Analysts to planners ?

– Planners to Simulations? SimQL

– Decision makers: New tools !

Gio Wiederhold PDM 21

Interfaces enable integration:New: SimQL to access Simulations

timetimepast now futurespast now futures

Msgsystems,Sensors

Streaming data

Databases and schemas, accessed via SQL or XML

Simulations, accessed via SimQL and

schema compliant wrappers

Gio Wiederhold PDM 22

Parser

MetadataManager

Querymanager

SchemaManager

Wrapped .. SimulationsMetadata

DevelopmentInteraction

Production Interaction

Filing ofAccessSpecs

Use of AccessSpecs

Initiation and Results of Simulations

SchemaCommands

SchemaCommands

Help

Errorreports

CustomerDeveloper

Help

Query

SimQL proof-of-concept ImplementationSimQL proof-of-concept Implementation

o o

Gio Wiederhold PDM 23

Demonstration of SimQL

Business planningspreadsheets

Weather onthe Internet

Engineering simulation

wrapper wrapper wrapper

Test Applications

Simple GUIcommon language

requirements

Shippinglocation database

Gio Wiederhold PDM 24

Information system use of simulation results

Simulation results are mapped to alternative Courses-of-actionsInformation system should support model

driving the the computation and recomputation of likelihoods

Likelihoods change as now moves forwards and eliminates earlier alternatives.

timetime0.40.4

0.60.6

0.20.2

0.50.5

0.30.3

0.50.5

0.20.20.10.1

0.10.1

0.10.1

0.030.030.070.07

0.10.1

0.50.5 0.30.3

0.20.2

prob

Gio Wiederhold PDM 25

The likelihoods multiply out to the end-effects then their values can be applied to earlier

nodes

10001000

20002000

50005000

10001000

00

-6000-6000

-3000-3000

ValuesValues

12001200

6666

134134

-1220-1220

12661266

--10861086

past now futurepast now futuretimetime

Next period alternatives

0.4

0.6

0.1

.

and subsequent periods

prob

0.1

.

0.2

0.1

0.5

0.30.2

0.1

0.07

0.4

0.3

0.13

.

0.3

0.2

value

100100

600600

1100 5001100 500

200 200200 200

-420 0-420 0

-820 -400-820 -400

Gio Wiederhold PDM 26

Recomputation is needed at the next time phase

past now futurepast now future

Re-assess as timeRe-assess as timemarches forward !marches forward !

A Pruned Bush A Pruned Bush

Databases, . . .Spreadsheets,

other simulations,

Msgssensors

10001000

20002000

50005000

10001000

00

100 100

600600

1100 5001100 500

200 200200 200

00

12001200

6666

timetime

1266 ?1266 ?

?? ??

Gio Wiederhold PDM 27

Even the present needs SimQL

timetimepastpast now now futurefuture

last recorded observations

simple simulationsto extrapolate data

Is the delivery truck in X?

• Is the right stuff on the truck?

• Will the crew be at X?

• Will the forces be ready to accept delivery?

point-in-time for situational assessment

Not all data are current:

Gio Wiederhold PDM 28

Integrative information systems: research questions

• What human interfaces can support the decision maker?

• How to move seamlessly from the past to the future?

• What system interfaces are good now and stay adaptable

• How can multiple futures be managed (indexed)?

• How can multiple futures be compared, selected?

• How should joint uncertainty be computed?

• How can the NOW point be moved automatically?

Gio Wiederhold PDM 29

SimQL research questions

• How little of the model needs to be exposed?

• How can defaults be set rationally?

• How should expected execution cost be reported?

• How should uncertainty be reported?

• Are there differences among application areas that require different language structures?

• Are there differences among application areas that require different language features?

• How will the language interface support effective partitioning and distribution?

Gio Wiederhold PDM 30

Moving to a Service Paradigm

Interfaces define service potentials

• Server is an independent contractor, defines service

• Client selects service, and specifies parameters

• Server’s success depends on value provided

• Some form of payment is due for services

x,y

Databases are a current example.Simulations have the same potential.

Gio Wiederhold PDM 31

Summary of SimQL

A new service for Decision Making:• follows database paradigm

– ( by about 25 years )

• coherence in prediction– displacement of ad-hoc practices

• seamless information integration – single paradigm for decision makers

• simulation industry infrastructure– investment has a potential market– should follows database industry model:

Interfaces promote new industries

Gio Wiederhold PDM 32

extensions for network support are also disjoint

Do no

t int

erop

erat

e

Summary:Today decision making support is disjoint, each community improves its area and ignores others

Distribution

Databases

Simulation

Planning Science

Gio Wiederhold PDM 33

The decisionmaker has few tools

• Spreadsheets

• Planning of allocations

• Other simulations

various point assessments

past now futurepast now futuretimetime

Data integration

distributed, heterogeneous

x17 @qbfera ffga 67 .78 jjkl,a nsnd nn 23.5a

Databases

Intuition +

organized support disjointed support

Gio Wiederhold PDM 34

DatabasesDatabases

Coda: Put relevant work together and move on

Support integration of results mined from past data, current observations, and predictions about the

futures.

o o

Simulation Support ServicesSimulation Support Services

Decision MakerDecision Maker

Service interfaces

Human interfaces

Data MiningData Mining

o oModeling toolsModeling toolso o

?

RealReal

InformationInformation

SystemsSystems