SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

35
1 SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS Jesus A. Gonzalez Supervisor: Dr. Lawrence B. Holder Committee: Dr. Diane J. Cook Dr. Lynn Peterson

description

SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS. Jesus A. Gonzalez Supervisor:Dr. Lawrence B. Holder Committee:Dr. Diane J. Cook Dr. Lynn Peterson. OUTLINE. Motivation and Goal. Knowledge Discovery with Subdue. - PowerPoint PPT Presentation

Transcript of SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

Page 1: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

1

SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL

DOMAINS

Jesus A. GonzalezSupervisor: Dr. Lawrence B. HolderCommittee: Dr. Diane J. Cook

Dr. Lynn Peterson

Page 2: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

2

Motivation and Goal.

Knowledge Discovery with Subdue.

Application to two Real-World Relational

Databases.

Comparison of Subdue with ILP Systems.

Conclusion and Future Work.

OUTLINE

Page 3: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

3

MOTIVATION AND GOAL

Need to analyze large amounts of information in

real world databases.

Information that standard tools can not detect.

Aviation Safety Reporting System Database.

Earthquake Database.

Previous knowledge: Spatio-Temporal relations.

Page 4: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

4

THE KDD PROCESS

SPECIFICDOMAIN DATA

SELECTION

DATASET

DATAPREPARATION

DATATRANSFORMATION

CLEAN,PREPARED

DATA

FORMATTED ANDSTRUCTURED

DATA

DATAMINING

FOUNDPATTERNS

PATTERNEVALUATION

KNOWLEDGEKNOWLEDGEAPPLICATION

DATACOLLECTION

SUBDUE

Page 5: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

5

SUBDUE KNOWLEDGE DISCOVERY SYSTEM

SUBDUE discovers patterns (substructures) in structural data sets.

SUBDUE represents data as a labeled graph.

Inputs: Vertices and Edges.

Outputs: Discovered patterns and instances.

Page 6: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

6

EXAMPLE

objecttriangle

objectsquareon

shape

shape

Vertices: objects or attributesEdges: relationships

4 instances of

Page 7: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

7

Starts with a single vertex and expand by one

edge.

Computationally Constrained Beam Search.

Space is all Sub-graphs of Input Graph.

Guided by Compression Heuristics.

SUBDUE’S SEARCH

Page 8: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

8

EVALUATION CRITERION

Minimum Encoding.

Graph Compression.

Substructure Size (Tried but did not work).

Page 9: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

9

EVALUATION CRITERIONMINIMUM DESCRIPTION LENGTH

Minimum Description Length (MDL) principle. The best theory to describe a set of data is the one that minimizes the DL of the entire data set.

DL of the graph: the number of bits necessary to completely describe the graph.

Search for the substructure that results in the maximum compression.

Page 10: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

10

THE ASRS DATABASE

The Aviation Safety Reporting System (ASRS).

Reports of incidents that might affect the aviation safety.

Some fields modified or omitted to keep the pilot’s identity confidential.

72,504 records, with 74 fields each.

Page 11: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

11

THE ASRS DATABASE KNOWLEDGE REPRESENTATION

EVENT 1

Small_Transport

ATC

Cockpit

Others

2.000000

Land_Plane

EVENT 2

EVENT m

Near_in_distance

Acft _type

Detectors

Detectors

Detectors

Num _engine

Surface

Page 12: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

12

THE ASRS DATABASEPRIOR KNOWLEDGE

Connections between events where related airports are near to each other.

An airport is near another airport if the distance between them is not more than 200 km.

Spatial relations represented with “near_in_distance” edges.

Page 13: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

13

THE ASRS DATABASERESULTS

Data set: “CONSEQUENCES”: “ACFT_DAMAGED” or “INJURY”. “ACFT_TYPE”: “MED_LARGE_TRANSPORT”.

Graph: 1,053 events, 42,723 vertices, 41,669 directed

edges and 18,373 undirected edges. File size: 2,143,356 bytes.

Page 14: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

14

THE ASRS DATABASE RESULTSMINIMUM ENCODING HEURISTIC

Substructure 1 Found with the Minimum Encoding Heuristic with 374 instances.

Event

Med _Large_Transport2.000000

Turbojet IFR

RetractablePassenger

2.000000Air_Carrier

OccFlight_Crew

Land_PlaneLow_Wing

Acft _type Crew_ size

Engine_typFlt _plan

Lndg _gear

Num _engineOperator

Mission

Report_typ

Role

SurfaceWings

Event

Med _Large_Transport2.000000

Turbojet

Retractable

2.000000Air_Carrier

Occ

Land_PlaneLow_Wing

Acft _type Crew_ size

Engine_typ

Lndg _gear

Num _engineOperator

Report_typ

SurfaceWings

Near_in_distance

Page 15: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

15

THE ASRS DATABASE RESULTSMINIMUM ENCODING HEURISTIC

Sub_1

0.0Acft_damaged

VMCAirport

Daylight

Alt_agl_hiConsequenc

Flt_condit

Alt_agl_loLighting

Fac_type

0.0

Substructure 3 Found with the Minimum Encoding Heuristic with 286 instances.

Page 16: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

16

THE ASRS DATABASE RESULTSMINIMUM ENCODING HEURISTIC

Sub_2 EventNear_in_distance

Substructure 4 Found with the Minimum Encoding Heuristic with 67 instances.

Page 17: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

17

THE ASRS DATABASE RESULTSMINIMUM ENCODING HEURISTIC

Subdue was able to geographically relate incidents that occurred near to each other and with the same characteristics.

This information is valuable for investigating similar events in a particular region that might be caused for the same reason.

Page 18: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

18

THE ASRS DATABASE RESULTSGRAPH COMPRESSION HEURISTIC

Substructure 3: Problem happening in a region determined by the area where the substructures were found.

Substructure 3 interpretation: Two incidents that happened near to each other. If airplane identification and complete date and time. Might find and trace an airplane that failed near one

airport, was reported and later had to land close to this first airport due to another failure.

Page 19: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

19

THE EARTHQUAKE DATABASE

Several catalogs.

Sources like the National Geophysical Data Center.

Each record with 35 fields describing the earthquake characteristics.

Page 20: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

20

THE EARTHQUAKE DATABASEKNOWLEDGE REPRESENTATION

EVENT 2

EVENT 1

EVENT 3

EVENT m

PDE_W

1998

01

4.5

Near_in_distance

Near_in_time

Category

Year

Month

Magnitude

Page 21: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

21

THE EARTHQUAKE DATABASEPRIOR KNOWLEDGE

Connections between events whose epicenters were close to each other in distance (<= 75 kilometers).

Connections between events that happened close to each other in time (<= 36 hours).

Spatio-Temporal relations represented with “near_in_distance” and “near_in_time” edges.

Page 22: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

22

THE EARTHQUAKE DATABASERESULTS

Sample of the events that happened in one year.

All the fields in the records were considered.

Graph: 10,135 events, 136,077 vertices, 125,941

directed edges and 757,417 undirected edges. Graph file size: 26,963,605 bytes.

Page 23: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

23

THE EARTHQUAKE DB RESULTSGRAPH COMPRESSION HEURISTIC

Substructure 8 Found with the Graph Compression Heuristic with 140 instances.

33.0000

Sub-1 Sub-7Near_in_time

Depth

Page 24: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

24

THE EARTHQUAKE DB RESULTS

Graph Compression works faster --> more iterations.

Given enough time MDL could find those substructures. MDL finds substructures using Spatio-Temporal relations.

Subdue found relations with fields like “Catalog”, “Month”, “Mag1 Scale”, and “Depth”.

More earthquakes happened in the months of May and June.

Most frequent earthquake depths were 33 and 10 kilometers.

Page 25: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

25

DETERMINING EARTHQUAKE ACTIVITY

Geologist Dr. Burke Burkart. Study of seismology caused by the Orizaba Fault.

Page 26: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

26

Geologist Dr. Burke Burkart. Study of seismology caused by the Orizaba Fault. Fault: A fracture in a surface where a displacement of

rocks also happened. Selection of the area of study, two squares:

First Longitude 94.0W through 101.0W and Latitude 17.0N through 18.0N.

Second Longitude 94.0W through 98.0W and Latitude 18.0N through 19.0N.

DETERMINING EARTHQUAKE ACTIVITY

Page 27: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

27

DETERMINING EARTHQUAKE ACTIVITY

Divide the area in 44 rectangles of one half of a degree in both longitude and latitude.

Sample the earthquake activity in each sub-area.

Run Subdue in each sub-area.

Page 28: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

28

DETERMINING EARTHQUAKE ACTIVITY

Area CoordinatesAreaNumber

Latitude Longitude

AreaName

Number ofEvents

1 101.0W 100.5W 17.0N 17.5N Gue1 622 101.0W 100.5W 17.5N 18.0N Gue2 403 100.5W 100.0W 17.0N 17.5N Gue3 574 100.5W 100.0W 17.5N 18.0N Gue4 135 100.0W 99.5W 17.0N 17.5N Gue5 716 100.0W 99.5W 17.5N 18.0N Gue6 157 99.5W 99.0W 17.0N 17.5N Gue7 358 99.5W 99.0W 17.5N 18.0N Gue8 169 99.0W 98.5W 17.0N 17.5N Gue9 1310 99.0W 98.5W 17.5N 18.0N Gue10 14

26 95.0W 94.5W 17.5N 18.0N Ver1 4327 94.5W 94.0W 17.0N 17.5N Oaxver4 3528 94.5W 94.0W 17.5N 18.0N Ver2 2329 98.0W 97.5W 18.0N 18.5N Pue1 630 98.0W 97.5W 18.5N 19.0N Pue2 0

42 95.0W 94.5W 18.5N 19.0N Vergolf5 143 94.5W 94.0W 18.0N 18.5N Vergolf4 344 94.5W 94.0W 18.5N 19.0N Vergolf6 1

Page 29: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

29

DETERMINING EARTHQUAKE ACTIVITY

33.00

Substructure 2, 8 instances.

Sub_1

N %

Depth Dept_ctl Coord_qual..

PDE

Substructure 1, 19 instances.

Event EventNear_in_distance

Category

PDE

Category

61.00 61.00

Region_numberRegion_number

Substructure 1 (with 19 instances) and substructure 2 (with 8 instances) found in sub-area 26.

Page 30: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

30

DETERMINING EARTHQUAKE ACTIVITY

This pattern might give us information about the cause of the earthquakes.

Subduction also affects this area but it affects at a specific depth according to the closeness to the Pacific Ocean.

Page 31: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

31

SUBDUE’S POTENTIAL

Subdue finds not only shared characteristics of events, but also space relations between them.

Dr. Burke Burkart is studying the patterns to give direction to this research.

Expect to find patterns representing parts of the paths of the involved fault.

Time relations not considered by Subdue. Earthquake’s characteristics. Important for other areas.

Page 32: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

32

COMPARISON OF SUBDUE WITH ILP SYSTEMS

Inductive Logic Programming (ILP) learn logical relations.

FOIL, GOLEM, PROGOL.

SUBDUE competitive in several domains.Table 7. Number of Rules Used and Average of Errors Made by System per Domain

DOMAIN FOIL GOLEM SUBDUEVote 8 / 3.0 9 / 4.3 1 / 9.3

Credit 83 / 33.5 234 / 48.5 1 / 51.2Diabetes 21 / 30.8 113 / 39.4 1 / 30.6

Page 33: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

33

CONCEPT LEARNING SUBDUE

ILP systems take positive and negative examples represented with First Order Logic.

New Concept Learning Subdue (CLSubdue) does too.

Can learn multiple rules.

Evaluation is ongoing.

Page 34: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

34

CONCLUSION

Subdue successful in real world databases.

Subdue discovered interesting patterns using the temporal

and spatial relations.

Subdue found significant patterns in the Orizaba Fault

Earthquake Database.

Subdue has potential to compete with ILP systems.

Subdue compared with Progol.

Page 35: SUBSTRUCTURE DISCOVERY IN REAL WORLD SPATIO-TEMPORAL DOMAINS

35

FUTURE WORK

Theoretical analysis. Show Subdue converges to optimal substructure. Better understanding of search space properties. Bounds on complexity (e.g. PAC learning).

Graphic User Interface to visualize substructures and their instances.

Express ranges of values (ranges of depth, magnitude, latitude, longitude, etc. in the Earthquake database).

Continue Evalutation in Real-World Spatio-Temporal Databases.