Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore...
-
Upload
emil-holland -
Category
Documents
-
view
213 -
download
0
Transcript of Chandrika Kamath and Imola K. Fodor Center for Applied Scientific Computing Lawrence Livermore...
Chandrika Kamath and Imola K. Fodor
Center for Applied Scientific ComputingLawrence Livermore National Laboratory
Gatlinburg, TNMarch 26-27, 2002
Dimension Reduction and Sampling First SDM ISIC All-Hands Meeting
UCRL. This work was performed under the auspices of the U.S. Department of Energy by University of California Lawrence Livermore National Laboratory under contract W-7405-Eng-48.
Dimension Reduction and Sampling at LLNL-2CASC
The SDM ISIC aims to minimize the effort researchers spend in managing their data
LLNL is participating in several of the tasks, including—data mining to improve the management of data
Problem: data from simulations and experiments is high dimensional (i.e. many features)
Querying the features can help in understanding the data— but, searching in a high-dimensional space is difficult
May want to cluster similar objects for efficient access—but, clustering is expensive in high dimensions
We plan to address the problem of high dimensionality using techniques for dimension reduction and sampling originally developed in data mining.
Dimension Reduction and Sampling at LLNL-3CASC
Our work on dimension reduction will help both data management and mining
Reducing the dimensions will improve—searching (task 3.1, LBNL)—clustering (task 2.1, ORNL)
Dimension reduction is expensive if many data items—use a sample of the data items —techniques for sampling in presence of rare events
We will focus on climate and high-energy-physics data—complements work at ORNL (climate), LBNL (HEP)—but, techniques applicable to other data as well
We only report the .8 FTE work funded under SciDAC; however, our data mining research is more extensive. See www.llnl.gov/casc/sapphire
Dimension Reduction and Sampling at LLNL-4CASC
There are two different ways in which we can view dimension reduction
Reduce the number of features representing a data item
Reduce the number of basis vectors used to describe the data: if some of the are small, they can be ignored
Features Features
Data items np
''2
'121 pn ffffff
ij
j
N
jiji rBasisVectoDataItem
1
Dimension Reduction and Sampling at LLNL-5CASC
Our work on climate data focuses on reducing the number of basis vectors
Domain expert Dr. Benjamin Santer (LLNL climate) Climate scientists are interested in understanding the
change in the earth’s surface temperature Simulated and observed data are mixtures of volcano, El
Niño, and other effects Our goal is to separate the signals corresponding to
different effects—traditional approaches such as principal component
analysis (PCA) have not worked —separation difficult as El Chichón and Pinatubo
volcano eruptions coincided with El Niño events—our approach is to use independent component
analysis (ICA)
Dimension reduction supporting scientific discovery
Dimension Reduction and Sampling at LLNL-6CASC
The raw data is as monthly temperatures on a 144x73 spatial grid on 17 vertical levels
ICA
Volcano
El Niño
Other effects
January 1979 raw temperatures (Kelvin) on the 144x73 latitude by longitude gridat 1000hPa pressure level. Data from NCEP.
Dimension Reduction and Sampling at LLNL-7CASC
Initially, we applied ICA to global monthly mean anomaly temperatures
Time series of global monthly mean anomalies, Jan 1979 - Dec 2000
17 vertical levels
level1: 1000hPa, lowest altitude
level17: 10hPa, highest altitude
Dimension Reduction and Sampling at LLNL-8CASC
Next, we ran experiments with simulated data to understand the behavior of ICA
(i) Two original sources (ii) Two mixed signals from the original
ICA estimates correctly the shapes of the two independent components (ICs).
With additional processing, we can also estimate the relative contributions of the two ICs in the two mixed signals.
(iii) Sources (ICs) recovered from (ii)
ICA
mix
Dimension Reduction and Sampling at LLNL-9CASC
Original decomposition of the two mixed signals (-): sine (--) and volcano (-.)
(i) Signal 1
(ii) Signal 2
Dimension Reduction and Sampling at LLNL-10CASC
(i) Signal 1
(ii) Signal 2
ICA decomposition of the two mixed signals (-): sine (--) and volcano (-.)
Dimension Reduction and Sampling at LLNL-11CASC
ICA can also separate “noise” used as an extra component in the mixing
3 originalsources
3 mixed signals
3 estimatedICs
mix
ICA
Dimension Reduction and Sampling at LLNL-12CASC
Original decomposition of 3 mixed signals (-): El Niño (--), volcano (-.), and noise (..)
Cooling in global series at the arrow is in fact a combination of an ENSO warming and a volcano cooling. Without the volcano eruption, the El Nino warming would dominate, resulting in warmer global
temperatures.
(i) Signal 1
(ii) Signal 2
(iii) Signal 3
Dimension Reduction and Sampling at LLNL-13CASC
ICA decomposition of 3 mixed signals (-): El Niño (--), volcano (-.), and noise (..)
Although not perfect in terms of the exact amplitudes, ICA clearly separates the cooling effect of the volcano from the warming effect of El Nino.
(i) Signal 1
(ii) Signal 2
(iii) Signal 3
Dimension Reduction and Sampling at LLNL-14CASC
Our future plans include work with HEP data and collaborators at ORNL and LBNL
Complete the work on the climate problem—our results with artificial data are encouraging—identify appropriate ICA model for climate data
Make the ICA software accessible to SciDAC scientists Try ICA and other dimension reduction techniques in
the context of the STAR high-energy-physics data—reduce number of features—investigate sampling to reduce computation—collaborate with LBNL (data, searching)
Investigate incremental PCA—monitor climate simulations using indices based on
the principal components—collaborate with ORNL (data, clustering)