Data MiningData MiningResearch and ApplicationsResearch and Applications
Workshop on CyberinfrastructureFor Environmental Research and
EducationOctober 31, 2002
Steve TannerInformation Technology and Systems Center
University of Alabama in [email protected]
256.824.5143www.itsc.uah.edu
What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?
What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?
How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?
How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?
Key Questions:Key Questions:
Data MiningData Mining
Data Mining is an interdisciplinary field drawing from areas such as statistics, machine learning, pattern recognition and others
Automated discovery of patterns, anomalies, etc. from vast observational and model data sets
Derived knowledge for decision making, predictions and disaster response ADaM – Algorithm Development and Mining System
datamining.itsc.uah.edu
Clustering Techniques– K Means– Isodata– Maximum
Pattern Recognition– Bayes Classifier– Minimum Distribution Classifier
Image Analysis– Boundary Detection– Cooccurrence Matrix– Dilation and Erosion– Histogram Operations– Polygon Circumscript– Spatial Filtering– Texture Operations
Genetic Algorithms Neural Networks Etc.
Techniques used for Data Techniques used for Data MiningMining
Data Mining systems usually involve a toolbox of many different techniques and a means for combining them
Google – Complex algorithm sequence to decide order
Amazon.Com – Additional purchase suggestions
Credit Card Fraud– Event notification of odd usage
Typical Everyday Typical Everyday Encounters with Data Encounters with Data MiningMining
Most current Data Mining applications are text based. Text provides an easily readable source of heterogeneous data. Mining of scientific data sets is more complex.
User Perspective and Data User Perspective and Data Perspective of the Data Perspective of the Data Mining ProcessMining Process
DataDataStoresStores
InformationInformation
AnalysisAnalysis
KnowledgeKnowledge
DecisionDecision
DatasetDataset
VolumeVolumeValueValue
Calibration Calibration & Navigation& Navigation
PreprocessingPreprocessing
TransformationTransformation
DatasetDatasetSpecific Specific AlgorithmsAlgorithms
DomainDomainSpecific Specific AlgorithmsAlgorithms
User PerspectiveUser Perspective Data PerspectiveData Perspective
DataData
Scientific Scientific AnalysisAnalysis
Scientific Scientific AnalysisAnalysis
Harnesses human analysis Harnesses human analysis capabilitiescapabilities
– Highly creativeHighly creative
Based on theory and Based on theory and hypothesis formulationhypothesis formulation
– Physical basis is normally Physical basis is normally used for algorithmsused for algorithms
Drawing insights about the Drawing insights about the underlying phenomena underlying phenomena
Rapidly widening gap between Rapidly widening gap between data collection capabilities and data collection capabilities and the ability to analyze datathe ability to analyze data
Potential of vast amounts of Potential of vast amounts of data to be unuseddata to be unused
Provides automation of the Provides automation of the analysis process analysis process
Can be used for dimensionality Can be used for dimensionality reduction when manual reduction when manual examination of data is impossibleexamination of data is impossible
Can have limitationsCan have limitations
– May not utilize domain May not utilize domain knowledgeknowledge
– May be difficult to prove May be difficult to prove validity of the results validity of the results
There may not be a physical There may not be a physical basisbasis
Should be viewed as Should be viewed as complimentary tool and not a complimentary tool and not a replacement for scientific replacement for scientific analysisanalysis
Data Data MiningMiningData Data MiningMining
Similarity between Data Mining Similarity between Data Mining and Scientific Analysis Processand Scientific Analysis Process
Mining Framework (ADaM)– Complete System (Client and Engine)– Mining Engine (User provides its own client)– Application Specific Mining Systems– Operations Tool Kit– Stand Alone Mining Algorithms– Data Fusion
Distributed/Federated Mining– Distributed services– Distributed data– Chaining using Interchange Technologies
On-board Mining (EVE)– Real time and distributed mining– Processing environment constraints
Mining EnvironmentsMining Environments
Using the Mining Framework: Using the Mining Framework: Focusing on the information in Focusing on the information in datadata
Using the Mining Framework: Using the Mining Framework: Focusing on the information in Focusing on the information in datadata
TranslatedData
PreprocessedData
PreprocessedData
Patterns/ModelsPatterns/Models
ResultsResults
OutputGIF ImagesHDF Raster ImagesHDF Scientific Data SetsHDF-ESOPolygons (ASCII, DXF)SSM/I MSFC Brightness TempTIFF ImagesGeoTIFFOthers...
Preprocessing AnalysisClustering K Means Isodata MaximumPattern Recognition Bayes Classifier Min. Dist. ClassifierImage Analysis Boundary Detection Cooccurrence Matrix Dilation and Erosion Histogram Operations Polygon Circumscript Spatial Filtering Texture OperationsGenetic AlgorithmsNeural NetworksOthers…
Selection and Sampling Subsetting Subsampling Select by Value Coincidence SearchGrid Manipulation Grid Creation Bin Aggregate Bin Select Grid Aggregate Grid Select Find HolesImage Processing Cropping Inversion ThresholdingOthers...
Processing
InputPIP-2SSM/I PathfinderSSM/I TDRSSM/I NESDIS Lvl 1BSSM/I MSFC Brightness TempUS RainLandsatASCII GrassVectors (ASCII Text)HDFHDF-EOSGIFIntergraph RasterOthers...
The ADaM Processing ModelThe ADaM Processing Model
Raw DataRaw Data
Iterative Nature of the Iterative Nature of the Data Mining ProcessData Mining Process
DATA
PREPROCESSING CLEANING
AndINTEGRATION
MINING SELECTIONAnd
TRANSFORMATION
DISCOVERY
KNOWLEDGEEVALUATION
AndPRESENTATION
Distributed/Federated Mining: Distributed/Federated Mining: Meshing data and algorithms to Meshing data and algorithms to generate knowledgegenerate knowledge
Distributed/Federated Mining: Distributed/Federated Mining: Meshing data and algorithms to Meshing data and algorithms to generate knowledgegenerate knowledge
ADaM : Mining Environment for ADaM : Mining Environment for Scientific DataScientific Data
• The system provides knowledge discovery, feature detection and content-based searching for data values, as well as for metadata.
•contains over 120 different operations •Operations vary from specialized science data-set specific algorithms to various digital image processing techniques, processing modules for automatic pattern recognition, machine perception, neural networks, genetic algorithms and others
Classification Based on Classification Based on Texture Features and Edge Texture Features and Edge DensityDensity
Cumulus cloud fields have a very characteristic texture signature in the GOES visible imagery
Science Rationale: Man-made changes to land use cause changes in weather patterns, especially cumulus clouds
Comparison based on
– Accuracy of detection
– Amount of time required to classify
Parallel Version of Cloud ExtractionParallel Version of Cloud Extraction
Laplacian FilterSobel Horizontal
FilterSobel Vertical
Filter
Energy Computation
Energy Computation
Energy Computation
Energy Computation
Classifier
GOES Image
Cloud Image
GOES images can be used to recognize cumulus cloud fields
Cumulus clouds are small and do not show up well in 4km resolution IR channels
Detection of cumulus cloud fields in GOES can be accomplished by using texture features or edge detectors
Three edge detection filters are used together to detect cumulus clouds which lends itself to implementation on a parallel cluster
GOES Image Cumulus CloudMask
Automated Data Analysis for Automated Data Analysis for Boundary Detection and Boundary Detection and QuantificationQuantification
Analysis of polar cap auroras in large volumes of spacecraft UV images
Science Rationale: Indicators to predict geomagnetic storm
– Damage satellites– Disrupt radio
connection Developing different
mining algorithms to detect and quantify polar cap boundary
Polar Cap Boundary
Detecting SignaturesDetecting Signatures Science Rationale:
Mesocyclone signatures in Radar data are indicators of Tornadic activity
Developing an algorithm based on wind velocity shear signatures
– Improve accuracy and reduce false alarm rates
Genetic Subtyping Genetic Subtyping Using Hierarchical Using Hierarchical ClusteringClustering
Biologists are interested in comparing DNA sequences to see how closely related they are to one another
Phylogenetic trees are constructed by performing hierarchical clustering on DNA sequences using genetic distance as a distance measure
Such trees show which organisms are most likely share common ancestors, and may provide information about how various subtypes of organisms evolved
This information is useful when studying disease causing organisms such as viruses and bacteria, because genetically similar types should behave in similar ways
Mining on Data Ingest: Tropical Cyclone Detection
Advanced Microwave Sounding Unit (AMSU-A) Data
Calibration/Limb Correction/Converted to Tb
Mining Environment
Data Archive
Result
Results are placed on the web, made available to National Hurricane Center & Joint Typhoon Warning Center,
and stored for further analysis
Mining Plan:• Water cover mask to eliminate land• Laplacian filter to compute temperature
gradients• Science Algorithm to estimate wind
speed• Contiguous regions with wind speeds
above a desired threshold identified• Additional test to eliminate false positives• Maximum wind speed and location
produced
Hurricane Floyd
Further Analysis
KnowledgeBase
pm-esip.msfc.nasa.gov/
AMSU ProductGeneration
TMI AMSU-A SSM/I SSM/T2
OrderStaging
PM-ESIPCatalog
AMSU-A Ingest
ADaM-basedProcessing
Distributed Data Stores
Output
ProcessSubset//Grid/Format
In-put
ADaM Servers
TMI Ingest andProduct Generation
Data Ingest & Processing
Custom Processing
Web Interfaces & Applications
AMSU-A Images
Temperature Trends
STT Application
Visualization & Exploration
FTP
Cyclone Winds
Data Ordering
Multiple Mining Environments:Multiple Mining Environments:Passive Microwave ESIP Information Passive Microwave ESIP Information
SystemSystem
• Science data comes in: Different formats, types and structures Different states of processing (raw,
calibrated, derived, modeled or interpreted)
Enormous volumes
• Heterogeneity leads to data usability problems
• One approach: Standard data formats Difficult to implement and enforce Can’t anticipate all needs
Some data can’t be modeled or is lost in translation
• The cost of converting legacy data
• A better approach: Interchange Technologies
• Earth Science Markup Language
Interoperability: Accessing Interoperability: Accessing Heterogeneous DataHeterogeneous DataThe Problem
DATA FORMAT 1
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 2
DATA FORMAT 3
DATA FORMAT 3
READER 1 READER 2
FORMATCONVERTER
ESML LIBRARY
APPLICATION
DATA FORMAT 1
DATA FORMAT 1
DATA FORMAT 2
DATA FORMAT 2
DATA FORMAT 3
DATA FORMAT 3
The Solution
APPLICATION
ESMLFILEESMLFILE
ESMLFILEESMLFILE
ESMLFILEESMLFILE
Data
Chained Image Chained Image Processing ServicesProcessing Services
Data Files
ESML
WMS(Java/Windows)
Draw Image(PERL/C – Linux)
Data Files
KnowledgeBase
Service Chaining is used to integrate modules – or services – developed on distributed platforms and different languages for a single processing solution.
GeoCrop(Perl/Linux)
Resample(Perl/C – Linux)
Format(Perl/Linux)
Data StreamsCha
ined
Ser
vice
s
ESML Lib
Reader(Java/C+
Windows)
Data Integration using Web Data Integration using Web Mapping ServicesMapping Services
Globe AMSU-A KnowledgeBase
ITSC
Coastlines
Countries
MCS Events
Cyclone EventsAMSU-A Channel 01
AMSU-A data overlaid with MCS and Cyclone events for September 2000, merged with world boundaries from Globe.
Analysis: Correlate MCSs and cyclones with atmospheric temperatures for September 2000.
Fused Displays Fused Displays from Multiple from Multiple ServersServers
Model and Observation Data
FEATUREI
FEATUREII
FEATUREIII
FEATURE SET I
EVENT A
FEATUREX
FEATUREY
EVENT B
CO
NC
EP
TU
AL
L
EV
EL
CONCEPT MINING
DA
TA
FIL
EL
EV
EL
DECISION SUPPORT
MULTI-LEVEL MINING
Concept Hierarchy for Data Mining and Fusion
On-Board Real-Time On-Board Real-Time Processing Processing Sensor Control/TargetingSensor Control/Targeting
• Anomaly detection
• Data Mining• Autonomous
Decision Making
• Immediate response
• Direct satellite to Earth delivery of results
EVE – Environment for On-board Processing
www.itsc.uah.edu/eve
04/08/23 28
A Reconfigurable Web of Interacting Sensors
Ground NetworkGround Network
Ground Network
Military
Weather
Satellite Constellations
Communications
Example Plan: Threshold Example Plan: Threshold events in AMSU-A Streaming events in AMSU-A Streaming Data Data
EVE
Data Integration and Mining: From Global Information to Local Knowledge
Precision Agriculture
Emergency Response
Weather Prediction
Urban Environments
What is the most effective approach to developing an integrated framework and plan for an interdisciplinary environmental cyberinfrastructure?
What organizational structure is needed to provide long-term support for data storage, access, model development, and services for a global clientele of researchers, educators, policy makers, and citizens?
How will effective interagency and public-private partnerships be formed to provide financial support for such an extensive and costly system?
How can communication and coordination among computer scientists and environmental researchers and educators be enhanced to develop this innovative, powerful, and accessible infrastructure?
Key Questions:Key Questions:
Top Related