From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

27
From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA University of Illinois, Urbana-Champaign

description

From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA University of Illinois, Urbana-Champaign. ALG Mission. The specific mission of the Automated Learning Group is:  - PowerPoint PPT Presentation

Transcript of From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

Page 1: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

From D2K to SEASROverview

September 27, 2007

Loretta AuvilAutomated Learning Group, NCSA

University of Illinois, Urbana-Champaign

Page 2: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

ALG Mission

The specific mission of the Automated Learning Group is: • To collaborate with researchers to develop novel computer

methods and the scientific foundation for using historical data to improve future decision making

• To work closely with industrial, government, and academic partners to explore new application areas for such methods, and

• To transfer the resulting software technology into real world applications

Page 3: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

Knowledge Discovery Process

Page 4: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

0

10

20

30

40

50

60

ObjectivesDetermination

Data Preparation Data Mining Interpretation/Evaluation

Effo

rt (%

)

Required Effort for each KDD Step

Arrows indicate the direction we want the effort to go.

Page 5: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

Three Primary Paradigms

• Predictive Modeling – supervised learning approach where classification or prediction of one of the attributes is desired.– Classification is the prediction of predefined classes

• e.g. Naive Bayesian, Decision Trees, and Neural Networks– Regression is the prediction of continuous data

• e.g. Neural Networks, and Decision (Regression) Trees

• Discovery – unsupervised learning approach for exploratory data analysis.– e.g. Association Rules, Link Analysis, Clustering, and Self Organizing Maps

• Deviation Detection – identifying outliers in the data.– e.g. Visualization

Page 6: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

• Provides scalable environment from the Desktop to Web Services

• Employs a visual programming system for data/work flow paradigm

• Provides capability to build custom applications

• Provides capability to access data management tools

• Contains data mining algorithms for prediction and discovery

• Provides data transformations for standard operations

• Integrated environment for models and visualization

• Supports an extensible interface for creating one’s own algorithms

• Provides access to distributed computing capabilities

D2K- Framework for Data Analysis

Page 7: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Components

• D2K Infrastructure• Itinerary Execution engine

• D2K-Driven Applications• Applications that make use of the D2K

Infrastructure• Toolkit is a D2K-Driven app

• D2K Server• Special kind of D2K-Driven app• Wraps the infrastructure to provide remote

itinerary and module execution• Used by the Toolkit to distribute module

execution• D2K Web Service

• Provides a generic programmatic interface for executing itineraries

• Communicates with D2K Servers over socket connections using D2K Specific protocols.

Page 8: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Streamline (D2K SL)

• Provides step by step interface to guide user in data analysis

• Supports return to earlier steps to run different parameters

• Uses the D2K infrastructure transparently

• Uses same D2K modules• Provides way to capture

different experiments • Define templates that can

be reused in different experiments

Page 9: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Web Service Architecture

• Any web enabled client can connect to and use the D2K Web Service by sending SOAP messages over HTTP.

• Itineraries and modules are stored on the web service machine and loaded over the network by the D2K Servers.

• Job results are also stored in the web service tier.

– Results are returned to clients upon request.

• A relational database is used by the web service to lookup accounts, itineraries, servers, and jobs.

• Remote D2K Servers handle itinerary processing. If possible, modules should load any data from remote locations.

Page 10: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

PredictionIndustrial ManufacturerComputed customer buying propensitiesAchieved 25% conquest customer sales lift by executing directed cross/upsell resulting in $65 million in incremental revenue

DiscoveryAutomotive manufacturerIdentified patterns of inappropriate warranty work in dealer channelTargeted $200M+ of potentially unnecessary annual expense

MonitoringDepartment store retailerWatched POS transaction flow for unusual variationsDeterred inappropriate behavior and fraudulent transactionsResulted in savings of over $125 million

Creating Customer Value

Page 11: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

Applications Examples

Harris A. Lewin explains that Evolution Highway allows one to look " . . . at the whole genome at once - multiple chromosomes across multiple species. The insights wouldn't have come so quickly if we couldn't throw the data at this framework from NCSA.”

Nicholas M. Ball, Robert J. Brunner, Adam D. Myers, and David Tcheng, Robust Machine Learning Applied to Astronomical Data Sets. I. Star-Galaxy Classification of the Sloan Digital Sky Survey DR3 Using Decision Trees, The Astrophysical Journal, Vol. 650, Part 1, Pages 497–509, 2006

Comparative Genomics

Science, Vol. 309, Issue 5734, Pages 613-617, 22 July 2005

Music AnalysisJ. Stephen Downie, The Scientific Evaluation of Music Information Retrieval Systems: Foundations and Future, Computer Music Journal, Vol. 28, No. 2, Pages 12-23 Summer 2004

Astronomy

Page 12: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

RiverGlassNCSA

D2K- Lineage

1996 1997 1998 1999 2000 2001 2002 2003 2004

● RiverGlass Detect™ ● RiverGlass Detect™

2005 2006 2007 2008 2009

● T2K / ThemeWeaver ● Full Multi-language ● D2K / Data to Knowledge ● D2K Streamline

● I2K / Image to Knowledge ● M2K / Music to Knowledge

● MAIDS / Mining Alarming Incidents from Data Streams

● RiverGlass Recon™

Interface

Fed.QueryInferenceEng.WebAcquireStreamMiningAudio MiningImageMiningTextMiningDataMining

Visualization

MultimediaSensors/RFID

Music AnalysisMotionMining

● Sensors/RFID ● Multimedia

● MotionMining

● GeoSpatial

Future Research, Technology, Applications

Engagements F100 I

nsuran

ce

F100 E

quipMfg

F100 C

ommMfg

F100 R

etaile

r

F100 E

quipMfg(2)

F100 A

utoMfg(2)

F100 C

ommMfg(2)

F100 R

etaile

r

F100 A

ircraf

tMfg

F100 E

quipMfg

F100 R

etaile

r

F100 O

il Co

F100 I

nsuran

ce

F100 E

quipMfg

StateA

gcy

F100 A

gResea

rch

F100 E

quipMfg

Higher Educ

F100 C

ommMfg(2)

F100 I

nsuran

ce

F100 E

mergPlan

F100 C

ommMfg(2)

GovTec

h

LawEnforce

ment

Fedl A

gcy

EmergMgmt

Fedl A

gcy

Higher Educ

GovTec

h

Fedl S

I

Fedl A

gcy

Fedl A

gcy

LawEnforce

ment

F500 I

nsuran

ce

LawEnforce

ment

F100 O

il Co

GeoSpatial ● One Llama Media

One Llama

RiverGlass, Inc.

Page 13: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K ToolKit

1. Workspace2. Resource

Panel3. Modules4. Models5. Itineraries6. Visualizations7. Generated

Visualizations8. Generated

Models9. Component

Information10. Toolbar11. Console

Page 14: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Basic

• Set of D2K Modules to perform data mining techniques– Prediction

• Decision Trees– C4.5 Decision Tree, Continuous Decision Tree, SQL Rain Forest Decision Tree

• Naïve Bayesian Classification and SQL Naïve Bayesian Classification• Neural Networks

– Discovery• Rule Association

– Apriori, FP Growth, Htree• Clustering

– Hierarchical Agglomerative, Kmeans, Coverage, etc.

• Includes visualizations for many of the modeling approaches• Includes a set of data transformations

– Attribute selection, binning, filtering, attribute construction• Includes optimization strategy for searching parameter space

Page 15: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Modules

Input Module: Loads data from the outside world.– Flat files, database, etc.

Data Prep Module: Performs functions to select, clean, or transform the data– Binning, Normalizing, Feature Selection, etc.

Compute Module: Performs main algorithmic computations.– Naïve Bayesian, Decision Tree, Apriori, FP Growth, etc.

User Input Module: Requires interaction with the user.– Data Selection, Input and Output selection, etc.

Output Module: Saves data to the outside world.– Flat files, databases, etc.

Visualization Module: Provides visual feedback to the user.– Naïve Bayesian, Rule Association, Decision Tree, Parallel Coordinates, 2D Scatterplot,

3D Surface Plot

Page 16: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Module Icon Description

Module Progress BarAppears during execution to show the percentage of time that this module executed over the entire execution time. It is green when the module is executing and red when not.

Input PortRectangular shapes on the left side of the module represent the inputs for the module. They are colored according to the data type that they represent

Properties SymbolIf a “P” is shown in the lower left corner of the module, then the module has properties that can be set before execution.

Output PortRectangular shapes on the right side of the module represent the outputs for the module. They are colored according to the data type that they represent.

Page 17: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

D2K Demo

Page 18: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASR: Research, Development, & Technology Transfer Model

Page 19: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASR: The Data ProblemStructured Vs. Unstructured

1999

GIG

ABYTES

Cave paintings,Bone tools 40,000

BCEWriting 3500 BCE0 C.E.

Paper 105Printing 1450

Electricity, Telephone 1870

Transistor 1947Computing 1950

Internet (DARPA) Late 1960sThe Web 1993

20% 20% Structured Structured DataData

80%80% Unstructured Unstructured DataData

Today, 80% of business is conducted Today, 80% of business is conducted on unstructured informationon unstructured information– – Gartner Group

80% of the information 80% of the information needed needed is in the Open Sourceis in the Open Source– – NIA

Workers spend 80% of the Workers spend 80% of the time gathering time gathering informationinformation– – STIC, EMF

www.fastsearch.com

Page 20: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASR

Software Environment for the Advancement of Scholarly Research (SEASR)

– addresses the challenges of transforming information into knowledge by constructing the software bridges that are required to move from the unstructured and semi-structured data world to the structured data world.

– aims to make collections more useful by integrating two well-known research and development frameworks NCSA’s Data-To-Knowledge (D2K) and IBM’s Unstructured Information Management Architecture (UIMA) into an easily usable environment that researchers in any discipline can easily learn and adapt for their own unstructured data analysis.

Page 21: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASR: Architecture

SEASR’s advanced informatics tools will expand the technical capabilities of what is now available in the field by:

• connecting data sources that are currently incompatible, whether due to different formats or protocols

• offering all project components as open source, to enable users to modify and add to tools

• allowing users to write analytic engines in their programming language of choice

• installing on all hardware footprints, so that the tools can be brought to data sets where they are housed

• creating a repository for components that will support sharing and publishing among users

• enabling scalability so that components may run on a large variety of hardware footprints, including shared memory processors and clusters

Page 22: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASR Applications

NoraVis OpenLaszlo

FeatureLens

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

M2K

SEASR

DISCUS

Page 23: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

NoraVis OpenLaszlo

Page 24: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

Create by Anthony Don at http://www.cs.umd.edu/hcil/textvis/featurelens/.

FeatureLens: n-gram patterns

Page 25: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

Getting the “Band” Together

• June 2007 – Band formation– Project start date– More use ideas and framework discussions

• December – First ‘gig”– Framework and data app demonstration

• Vocals - Research Technology– John Unsworth, Stephen Downie, Tim Wentling– Dan Roth, Jiawei Han, Kevin Chang, Cheng Xiang Zhai

• Percussions & Bass - SEASR Development– Loretta Auvil, Tara Bazler, Duane Searsmith, Andrew Shirk, Students

• Lead – Designers/Developer/Applications Areas– Humanities – M2K, Nora/Monk and Others (we heard about

yesterday/today))• Need Groupies! (Advisors, Researchers, Developers, and Application

Drivers) – Loretta Auvil

Page 26: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASR: How can I participate?

• Collaborate on application development or ontology creation

• Contribute to component development for analytics or data access

• Participate in visualization and UI design

• Serve as an advisor

Contact Loretta Auvil ([email protected])

Page 27: From D2K to SEASR Overview September 27, 2007 Loretta Auvil Automated Learning Group, NCSA

SEASREngineering Knowledge for the Humanities

Thank You