Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of...

26
Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 [email protected] D2K – Data To Knowledge March 19, 2004 Duke University

Transcript of Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of...

Page 1: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

Loretta Auvil

Automated Learning GroupNational Center for Supercomputing ApplicationsUniversity of Illinois217. [email protected]

D2K – Data To Knowledge

March 19, 2004Duke University

Page 2: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Outline

• Overview of Data Mining

• Overview of D2K Functionality

• D2K Toolkit• MAIDS – Mining Streaming Data

• D2K Driven Application• ThemeWeaver – Mining Text Data• MAEViz – Visualizing Earthquake Damage Analysis

• D2K Streamline (SL)• EMO – Finding Optimal Decisions

• D2K Web Service• Phylomat – Finding Motifs in Sequences

Page 3: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

ALG Mission

The specific mission of the Automated Learning Group is:

 

• To collaborate with researchers to develop novel computer methods and the scientific foundation for using historical data to improve future decision making

• To work closely with industrial, government, and academic partners to explore new application areas for such methods, and

 

• To transfer the resulting software technology into real world applications

Page 4: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

ALG Research, Development, & Technology Transfer Model

Page 5: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

What is It?

Knowledge Discovery in Databases is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data

• The understandable patterns are used to:• Make predictions about or classifications of new data• Explain existing data• Summarize the contents of a large database to support decision

making• Create graphical data visualization to aid humans in discovering

complex patterns

Overview of Knowledge Discovery

Page 6: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Why Do We Need Data Mining ?

• Data volumes are too large for classical analysis approaches:• Large number of records (108 – 1012 bytes)• High dimensional data ( 102 – 104 attributes)

How do you explore millions of records, tens or hundreds or thousands of fields, and find patterns?

• As databases grow, the ability to use traditional query languages for the decision support process becomes infeasible

• Many queries of interest are difficult to state in a query language (query formulation problem)• “Find all cases of fraud”

• “Find all individuals likely to by Ford Explorer”

• “Find all documents that are similar to this customers problem”

Overview of Knowledge Discovery

Page 7: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Knowledge Discovery Process

Overview of Knowledge Discovery

Page 8: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Required Effort for each KDD Step

Arrows indicate the direction we want the effort to go

0

10

20

30

40

50

60

ObjectivesDetermination

Data Preparation Data Mining Interpretation/Evaluation

Eff

ort

(%

)Overview of Knowledge Discovery

Page 9: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Three Primary Paradigms

• Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired• Classification is the prediction of predefined classes

– Naive Bayesian, Decision Trees, and Neural Networks

• Regression is the prediction of continuous data

– Neural Networks, and Decision (Regression) Trees

• Discovery – unsupervised learning approach for exploratory data analysis• Association Rules and Link Analysis

• Clustering and Self Organizing Maps

• Deviation Detection – identifying outliers in the data• Visualization

Overview of Knowledge Discovery

Page 10: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Importance of Data Mining Framework

• Provides capability to build custom applications

• Provides access to data management tools

• Contains data mining algorithms for prediction and discovery

• Provides data transformations for standard operations

• Supports an extensible interface for creating one’s own algorithms

• Provides means for building and applying models

• Provides integrated visualizations components

• Provides access to distributed computing capabilities

Page 11: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K - Data To Knowledge

D2K is a flexible data mining system that integrates effective analytical data mining methods for prediction, discovery, and anomaly detection with data management and information visualization

D2K Overview

Page 12: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K and Its Many Components

• D2K InfrastructureD2K API, data flow environment, distributed computing framework and runtime system

• D2K ModulesComputational units written in Java that follow the D2K API

• D2K ItinerariesModules that are connected to form an application

• D2K ToolkitUser interface for specification of itineraries and execution that provides the rapid application development environment

• D2K-Driven ApplicationsApplications that use D2K modules with a custom user interface

• D2K Streamline (SL)Task driven system that uses D2K modules

• D2K Web/Grid ServicesEnables web deployment

D2K Overview

Page 13: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K Toolkit

Major features that D2K provides to an application developer include:

• Visual programming system employing a data flow paradigm

• Scalable distributed computing capabilities

• Flexible and extensible software development environment

• Multi-layered learning strategies

• Integrated environment for models and visualization

• Web service capabilities for deployment

D2K Overview

Page 14: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K Modules

Input Module: Loads data from the outside world• Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data• Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations• Naïve Bayesian, Decision Tree, Apriori, etc.

User Input Module: Requires interaction with the user• Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world• Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user• Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D

Scatterplot, 3D Surface Plot

D2K Overview

Page 15: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K Module Icon Description

Module Progress BarAppears during execution to show

the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red

when not

Input PortRectangular shapes on the left side

of the module represent the inputs for the module. They are

colored according to the data type that they represent

Properties SymbolIf a “P” is shown in the lower left

corner of the module, then the module has properties that can

be set before execution

Output PortRectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent

D2K Overview

Page 16: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

MAIDS: Mining Alarming Incidents in Data Streams

Stream Characteristics• Huge volumes of

continuous data, possibly infinite

• Fast changing and requires fast, real-time response

• Data stream captures nicely our data processing needs of today

• Random access is expensive—single linear scan algorithm (can only have one look)

• Store only the summary of the data seen thus far

• Most stream data are at pretty low-level or multi-dimensional in nature, needs multi-level and multi-dimensional processing

Current ALG Projects

Page 17: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

MAIDS

Using D2K Toolkit

Page 18: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Text Mining

• Information Retrieval• Indexing and retrieval of textual documents• Finding a set of (ranked) documents that are relevant to the query

• Information Extraction• Extraction of partial knowledge in the text

• Web Mining• Indexing and retrieval of textual documents and extraction of partial

knowledge using the web

• Classification• Predict a class for each text document

• Clustering• Generating collections of similar text documents

Current ALG Projects

Page 19: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Text Mining: Views from T2K and ThemeWeaver

Using D2K Driven Application

Page 20: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

MAEViz: Damage Synthesis Visualization

Using D2K Driven Application

• Displays terrain map

• Loads hazard, inventory, and fragility data

• Shows contour map of ground acceleration (hazard)

• Displays cones/bars to indicate level of damage

• Overlays shapefiles of different information

• Uses VTK for 3D

• Uses CUBE at BI

Page 21: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K Streamline (D2K SL)

• Provides step by step interface to guide user in data analysis

• Supports return to earlier steps to run with different parameters

• Uses the D2K infrastructure transparently

• Uses same D2K modules

• Provides way to capture different experiments

D2K SL

Page 22: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

EMO – Evolutionary Multiobjective Optimization

Using D2K SL

• Identify tradeoffs among complex objectives

• Apply a genetic algorithm (GA) optimization in a general framework

• Guide the user through discrete steps to defining decision variables, fitness functions, constraints, and setting up GA parameters

Page 23: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

D2K Web Service Architecture

• Any web enabled client can connect to and use the D2K Web Service by sending SOAP messages over HTTP.

• Itineraries and modules are stored on the web service machine and loaded over the network by the D2K Servers.

• Job results are also stored in the web service tier.• Results are returned to clients upon

request.

• A relational database is used by the web service to lookup accounts, itineraries, servers, and jobs.

• Remote D2K Servers handle itinerary processing. If possible, modules should load any data from remote locations.

Page 24: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Phylomat (Motif Analysis Tool for Phylogenomics)

Using D2K Web Service

Page 25: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

The ALG Team

StaffLoretta AuvilPeter BajcsyColleen BushellDora CaiDavid ClutterLisa GatzkeVered GorenChris NavarroGreg PapeTom RedmanDuane SearsmithAndrew ShirkAnca SuvaialaDavid TchengMichael Welge

StudentsRitesh AgrawalTyler AlumbaughJohn CasselSang-Chul LeeXiaolei LiJeff NgScott RamonMartin UrbanBei YuHwanjo Yu

Page 26: Loretta Auvil Automated Learning Group National Center for Supercomputing Applications University of Illinois 217. 265.8021 lauvil@ncsa.uiuc.edu D2K –

alg | Automated Learning Group

Licensing D2K

• Faculty, staff and students at US academic institutions will be able to license and use D2K for free by downloading from alg.ncsa.uiuc.edu

• Private Sector Partners who have provided funding for projects related to D2K will be able to license and use D2K for free

• Private Sector Partners who have not provided funding will be able to license and use D2K for a discounted fee

Contact John McEntireOffice of Technology Management308 Ceramics Building, MC-243105 South Goodwin AvenueUrbana, Illinois 61801-2901(217) [email protected]