GECKO-A Emulator AI4ESS Hackathon: Challenge · 2020. 7. 15. · GECKO-A Library: 2000 GECKO-A...

AI4ESS Hackathon: GECKO-A Emulator

Challenge

GECKO-A Challenge: Build An Emulator For 3-D Models?

GECKO-A Training Library Machine-Learning Emulator 3-D Models

● Many inspiring applications out there: machine-learning emulators using explicit/process-level models, and implementing the trained emulators into large-scale models. Such explicit/process-level models are otherwise too expensive for large-scale models.

● The goal of this project is to train the machine-learning emulator using the “library” generated by the hyper-explicit chemical mechanism, GECKO-A.

Goal: Build Emulator to Predict the Total Organic Aerosol

GECKO-A Library:● 2000 GECKO-A simulations: in each run, we run GECKO-A under certain condition for 5

days● 2000 input files (csv).● Each file contains: (i) mass of precursors; (ii) mass of products in the gas-phase; and (iii)

mass of products in the particle-phase. All (i)-(iii) as a function of time.

Demo: what the data looks like

Base Model Box Emulator Model

INPUT(t)

OUTPUT(t+1)

STARTING CONDITIONS

BASE MODEL INPUT(t)

BASE MODEL PREDICTION(t+1)

Loop for length of experiment

Team 42: Gecko-A EmulationJeonghoe Kim, Josh Alland, Hemanth SK. Vepuri

● Methodsa. Linear models (LinearRegression, Ridge, Lasso, ElasticNet), tree-based models (RandomForest,

GradientBoosting), and neural network models (DNN, CuDNN-LSTM, LSTM) were tested.b. LinearRegression and CuDNN-LSTM show the best performance.

● DataTime series of features in selected experiments Correlation Heat map

Team 42: Gecko-A Emulation● Metric Scores (Box Emulator) and Time Series of Concentrations

● Feature Selection by Random Forest

Lin. Reg. Precursor Gas Aerosols

RMSE 0.00500 0.02494 0.01381

R2 0.82047 0.65767 0.74146

Hellenger 0.31740 0.31260 0.35933

Cu-LSTM Precursor Gas Aerosols

RMSE 0.00377 0.03379 0.02400

R2 0.91294 0.57065 0.69842

Hellenger 0.06289 0.49109 0.27553

Lin. Reg. (Left) CuDNN-LSTM (Right)

Lessons Learned with AI4ESS and Hackathon● Machine Learning is a very powerful tool, but

precise and sophisticated design of ML model is required. Using ML models without consideration often makes a catastrophically bad prediction.

● Pursuing a “best” accuracy of ML model does not guarantee a successful adoption of ML model to the prediction of certain phenomena.

Team 48: GECKO● Jiaze Wang*, Antonio Lorenzo*, Lee Brent*, Jared Brewer*● Linear Regression, PCA, Random Forest Tree Regressor, Gradient Boosting

Tree Regressor, Fully Connected Neural Network, Res Neural Net(failed to work)

Metrics for Box Emulator:RMSE: Precursor: 0.00375, Gas: 0.01818, Aerosols: 0.02049R2: Precursor: 0.88393, Gas: 0.55147, Aerosols: 0.72645Hellenger Distance: Precursor: 0.35322, Gas: 0.22406, Aerosols: 0.54450<matplotlib.axes._subplots.AxesSubplot at 0x7f8031acddd8>

Linear Regression modelTrue_box Pred_box

Team 48: GECKO● Metrics of ML and parameter selection doesn’t work well● RNN doesn’t work● Lesson learned: general ideas about ML and applications of ML in earth

sciences, and basic knowledge on how to do ML in python● Challenges: Having trouble in parameter selection, metrics and visualization

on evaluation ML models with so limited knowledge on ML packages in python

Team 23: GECKO (Iyasu Eibedingil, Ales Kuchar)● Summary of methods tried

○ Added gaussian noise helped to improved densely connected NN performance in terms of Box Emulator, however, still not able to capture autocorrelation of outputs variables

○ Autoregression model was tested (see black line ) on top of NN may solve the issue above => motivation for LSTM architecture

Team 23: GECKO (Iyasu Eibedingil, Ales Kuchar)

Interpretation of the ML model

● Score importance using RMSE shows that our output variables at t0 are highly important for t0+1

● Otherwise temperature and OH seems to be most important

Challenges

● Lack of time/workforce● Jupyterlab issues

Team 10: GECKO● Team Members: Devon Dunmire, Errami Larbi, Jean Lim, Luke Thompson● Methods tried: Linear Regression, Random Forest, Dense Neural Network.

LSTM

Team 10: GECKO

Epochs

Loss

Challenges: - Our best model did not outperform base model- Interpretation of LSTM model

Metrics for base model:RMSE: Precursor: 0.00023, Gas: 0.00019, Aerosols: 0.00022R2: Precursor: 0.99972, Gas: 0.99994, Aerosols: 0.99993Hellenger Distance: Precursor: 0.00003, Gas: 0.00002, Aerosols: 0.00568

Metrics for LSTM:RMSE: Precursor: 0.00035, Gas: 0.00051, Aerosols: 0.00079R2: Precursor: 0.99949, Gas: 0.99972, Aerosols: 0.99961Hellenger Distance: Precursor: 0.00024, Gas: 0.00013, Aerosols: 0.00236

Team 17: GECKO Transformed data (Standard, MinMax, Power)Bowen Fang, Jonathan Eliashiv, Shuting Zhai, Esther Lee, Fernando Campo*, and Raghavendra S. Mupparthy*

Visualization of training input data

Summary of methods tried: Linear, Random Forest Regressor, DNN, simple RNN, LSTM RNN

Team 17: GECKO

Lessons learned from Hackathon:● Fanciest tools are not always the best● Data preparation (pipeline scaling) is really important● Even with non-Gaussian transformation(MinMax transform), the result was good● Precision is as important as accuracy. You can’t improve one without the other (RMSE, MAE)Challenges: spin-up time

The model prediction shows good agreement with validation data All of the models we tried produce similar result

(useful plots / charts)

Team 4: GECKOA visualization of your results scores on the problem

Any other cool visualization of results or interpretation of the ML model

Lessons learned/challenges: the main problem was to change dimensionality to perform CNN or LSTM

We were unable to set the box emulator to predict the whole time series (something that need more time for understanding)

Members: Zhenyang, Dinara, Diana, and Jahangir*

Team 4: GECKO Members: Zhenyang, Dinara, Diana, and Jahangir*

ML methods we’ve tried during the hackathon:

● Standard and gaussian pdf scaler

● Linear regression and random forest

● PCA (inapplicable though)● DNN with different

hyperparameter settings● LSTM

Default hyperparameters: # of layers = 2; # of neurons = 100; AF = relu; learning rate = 0.0001

Metrics are shown in the table:Using LR=0.001 or 5 layers would increase the model score.

AF = Sigmoid

Test data (true)

learning rate = 0.001 # of neurons = 300

Metrics Default AF=Sigmoid LR=0.001 50 neurons 300 neurons 5 layers

RMSE 0.006570.034690.04207

0.075310.037560.06179

0.002340.025330.02199

0.019890.031490.01892

0.013140.019440.02083

0.002810.049460.07283

R2 0.619430.038040.12799

0.204080.353890.00600

0.956940.547680.30909

0.176100.277220.66895

0.392150.231050.32559

0.909520.196550.27384

H.D. 0.325910.266090.42771

0.659700.530430.67728

0.201060.363040.32899

0.216630.308130.31965

0.383670.244220.49995

0.224520.353840.42799

# of layers = 5 # of neurons = 50

Team 33: GECKO● Team Members: Ethan Kyzivat, Weiming Hu, Hauke Schulz, Chen-Kuang

(Kevin) Yang● Summary of methods tried

○ Random forest (RF)○ Densely Neural Network (DNN)○ Long Short-term Memory (LSTM): we decided to use LSTM because it is well-known for

time-series prediction

● Data preprocessing ○ Standardization: sklearn “StandardScaler()”○ Base data: 2,000 experiments (1,440 time-steps per experiment) from GECKO○ Training/Validation/Testing: 1,400/200/200 experiments○ Input training data (3-D): [samples, time-steps, features] = [1435*1400, 5, 9]○ In an essence: we want to use the 9 features from the 5 previous time-steps to inform the

information of the next time-step (prediction)

Training the LSTM: multivariate and one-step prediction

● Hyperparameters○ Architecture: 64 neurons, ReLU, dropout = 0.2 (prevent overfitting)○ Training: loss function = MSE, optimizer = Adam, epoch = 5, batch size = 1024, no

shuffle on the data● Evaluation (the graphs above)

○ LSTM + Box Emulator Model○ Showing the testing result of 5 experiments

Team 14: Gecko

● Performed:○ Exploration: PCA, linear regression, Random Forest,

gradient boosting

○ Neural networks: DCNN, SimpleRNN, LSTM

○ Tested sensitivity to various hyperparameters

● Significant:○ Found and fixed the time lag bug in prepare_data

○ Wrote new data preparation, NN, and box emulators to be compatible with time series analysis

○ Wrote functions for the complete workflow for easy model tuning, comparison and visualization

● Difficulties: Learning Python on the fly!

Members: Glenn Liu, Yiluan Song, Laurette Hamlin

Detecting relative importance of predictors using RF

Nonlinear relationship between variables

Team 14: Gecko Testing hyperparameters in DNN, RNN, and LSTMEffect of Learning Rate on Aerosols

Number of Neurons on Precursor

Effect of Activation on Gas

The precursor variable seemed to be less sensitive to choice of hyperparameters.

RNN was the most sensitive to hyperparameters; LSTM was the least sensitive to hyperparameters.

Team 14: Gecko Best model so far:• Predictors: all 9 input variables at t-1, t-2, t-3, t-4, t-5• ML method: LSTM• Architecture: 1 input layer (9 neurons) + 2 hidden layers (100

neurons each) + 1 output layer (3 neurons)• Activation: “relu”• Learning rate: 0.001• Number of Epochs: 5

Model Type Metric Variable

Baseline LSTM Precursor Gas Aerosols

RMSE 0.00003 0.00012 0.00007R^2 0.99999 0.99998 0.99999

Hellenger Distance 0 0.00002 0.00002

Precursor Gas AerosolsBox

Emulator RMSE 0.00049 0.00574 0.00873

R^2 0.99822 0.96352 0.89409Hellenger Distance 0.00032 0.0603 0.27265

We could do better given more time and more computational power!

A Conceptual Note on LSTMs

“I grew up in France where I embraced the language and became fluent in ______ “

Summary● Results on the base model do not always translate directly to the box

emulator.

● Data preparation for RNN/LSTM is not easy!

● LSTM with 5 look-back timesteps seems to be adequate solution to this problem! (a next step would be to see if this model would perform well varying environmental factors)

● Excellent work everyone!

GECKO-A Emulator AI4ESS Hackathon: Challenge · 2020. 7. 15. · GECKO-A Library: 2000 GECKO-A...

Documents

Transcript of GECKO-A Emulator AI4ESS Hackathon: Challenge · 2020. 7. 15. · GECKO-A Library: 2000 GECKO-A...