R package “ randomForests ” Erick Towett

28
Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013 R package “randomForests” Erick Towett

description

Hands-on Soil Infrared Spectroscopy Training Course Getting the best out of light 11 – 15 November 2013. R package “ randomForests ” Erick Towett. Welcome. Outline Introduction Usage total element composition of Africa soils using total X-ray fluorescence ( TXRF). - PowerPoint PPT Presentation

Transcript of R package “ randomForests ” Erick Towett

Page 1: R package “ randomForests ”  Erick Towett

Hands-on Soil Infrared Spectroscopy Training Course

Getting the best out of light11 – 15 November 2013

R package “randomForests” Erick Towett

Page 2: R package “ randomForests ”  Erick Towett

2

Welcome

Outline• Introduction

• Usage• total element composition of Africa soils using total X-ray fluorescence

(TXRF).• combining MIR and TXRF for the prediction of soil properties.• MIRS randomForests prediction models for soil properties.

• Demo application of RF to MIRS calibration.

Page 3: R package “ randomForests ”  Erick Towett

3

• “randomForest” (RF) implements Breiman’s random forest algorithm for classification and regression based on a forest of trees using random inputs.

• Version 4.6-7

• Depends R (>= 2.5.0)

• Description: Classification and regression based on a forest of trees using random inputs. URL http://stat-www.berkeley.edu/users/breiman/RandomForests Reference: Breiman, L. (2001), Random Forests, Machine Learning 45(1), 5-32.

Introduction I

Page 4: R package “ randomForests ”  Erick Towett

4

RF is fast and easy to implement, produce highly accurate predictions

It runs efficiently on large data bases. It can handle thousands of input variables without variable

deletion and without overfitting. It gives estimates of variable importance in the classification. RF handles complex data types well. Obviates the need for transformation of predictors to

approximate normal distributions.

Features of Random Forests

Page 5: R package “ randomForests ”  Erick Towett

5

What are the challenges of RF?X There are many possible alternative nodes; X reseeding will give different models.

How does RF work?

• The out-of-bag (oob) error estimate In RF, each tree is constructed using a different bootstrap sample from the

original data. ~ 1/3 of the cases are left out of the bootstrap sample and not used in the

construction of the kth tree. Data to get a running unbiased estimate of classification error as trees are

added to the forest. It is used to get estimates of variable importance.

Features of RF

Page 6: R package “ randomForests ”  Erick Towett

6

• RF can output a list of predictor variables that are important in predicting the outcome.

• The randomForest package in R has two measures of importance. One is "total decrease in node impurities from splitting on the variable,

averaged over all trees.” The other is based on a permutation test.

How does RF work?

Page 7: R package “ randomForests ”  Erick Towett

7

Study 1:Variability and patterns in total element composition of sub-Saharan Africa (SSA) soils using TXRF.

The objectives were to; 1. quantify the variability in total element

composition of soils from a diverse set of soils across SSA using TXRF, and

2. explore the patterns in total element composition of soils analysed.

Usage

Page 8: R package “ randomForests ”  Erick Towett

8

Materials and Methods

• Soils from 34 randomly-located 100-km2 sentinel sites across Africa.

Page 9: R package “ randomForests ”  Erick Towett

Consistent field protocolSoil spectroscopy

Sentinel sites Randomized sampling schemes

• LDSF = a hierarchical spatially stratified random sampling scheme with ten 100 m2 plots nested within sixteen 1 km2 clusters, nested within 100 km2 sites.

Land degradation surveillance framework (LDSF)

Page 10: R package “ randomForests ”  Erick Towett

10

Materials and Methods

• Soil samples collected at two depths, 0-20 & 20-50 cm.

• Total of 1074 samples (16 samples per cluster x 2 soil depths x 34 sentinel sites) used for

exploring spectral (TXRF) patterns.

• Total element conc. for 17 elements; • Al, P, K, Ca, Ti, V, Cr, Mn, Fe, Ni, Cu, Zn, Ga, Sr, Y, Ta, & Pb.

Page 11: R package “ randomForests ”  Erick Towett

11

Materials and Methods

• PCA on the TXRF data

• RF regression of factors vs the first 5 PCs of the TXRF element conc.• to confirm whether site or soil-forming factors (e.g., mineralogy, climate,

topography & vegetation) are important drivers of total elemental conc. in the soil

• to view the importance of the predictor variables.

• Site factors extracted for each site from LDSF database & Worldclim data & mineralogy data from XRD analysis

• raw semi-quantitative mineralogy data & dominant mineralogy grouping.

Page 12: R package “ randomForests ”  Erick Towett

12

• Total element conc. values were within the range reported globally for soil Cr, Mn, Zn, Ni, V, Sr, & Y and in the high range for Al, Cu, Ta, Pb, & Ga.

Values compiled from this study (mg kg-1)

Reported mean and ranges of background contents of elements in crust and worldwide soils (mg kg-1)

Element Mean Range Worldwide ranges

Crustal Average

Worldwide mean

Median values Ghana soil

Al 33927 94 - 89068 10000-40000 - - -

P 143 25 – 2358 - - - -

K 10893 291 - 77898 - - - -

Ca 9780 82 - 426431 - - - -

Ti 4264 2.6 - 25611 200-24000 4400 - -

V 37 0.7 - 393 5.0-500 135 60 -

Cr 64 0.7 - 598 1-1500 100 42 72

Mn 466 1.6 - 6575 <7->9000 900 418 -

Fe 27954 20 - 181691 1000-550000 - - -

Ni 19 0.3 - 364 0.2-500 20 18 39

Cu 17 0.3 - 114 1.0-250 55 14 17-29

Zn 29 0.3 - 138 10-602 70 62 45-47

Ga 8 0.2 - 31 0.4-70 15 1.2 -

Sr 118 1.2 - 1985 32->1000 375 147 -

Y 13 0.2 - 109 16-33 33 12 -

Ta 3 0.1 - 16 0.8-5.3 2.0 1.1 -

Pb 37 0.3 - 638 2.0-16338 14 25 18-22

Results

Page 13: R package “ randomForests ”  Erick Towett

13

• Significant variations (P < 0.05) in total element composition within & between the sites for the 17 elements analysed.

• Greatest proportion of total variance & number of significant variance components occurred at the site (55-88%) followed by the cluster nested within site levels (10-40%).

Element

nSite Site*Cluster Site*Depth Depth Residual

Estimate %Tot var Estimate %Tot

var Estimate %Tot var Estimate %Tot

var Estimate %Tot var

Al1068 0.966

880.112

100.004

0.40.005

0.450.016

1.4

P 1059 0.718

760.198

210.002

0.21.4*10-21 <0.01

0.0252.6

K1065 0.913

710.354

280.003

0.26.8*10-21 <0.01

0.0100.8

Ca1068 2.186

790.480

170.034

1.20.017

0.600.051

1.8

Ti1067 1.398

870.199

120.001

0.10.001

0.040.014

0.9

V1067 1.463

770.379

200.009

0.50.008

0.390.053

2.8

Cr1068 0.808

650.384

310.005

0.40.006

0.460.039

3.2

Mn1067 1.007

680.393

270.023

1.60.008

0.510.040

2.7

Fe1066 1.459

800.335

180.005

0.30.009

0.470.026

1.4

Results

Page 14: R package “ randomForests ”  Erick Towett

14

•PCA revealed that patterns in total element conc. between sites appeared to relate to differences in mineralogical ‘functional groups’ .

• The pattern of clustering of the individual minerals and sorting of heavy minerals (V, Pb, Ni, Cr, Cu Ti, and Fe) along the positive Dim1 axis is apparent.

Biplots (arrow sizes are proportional to the “initial” variability in the elements present) based on the principal component Dim 1 vs Dim 2 and Dim 1 and Dim 3, on the log transformed data of the soil total element concentration from all sites analysed.

Results

Page 15: R package “ randomForests ”  Erick Towett

15

• Strong observed within site & between site variations in many elements can serve to diagnose of soil fertility potential.

• Elements clustered out differently in the different sample sets from different sentinel sites, indicating a wide variation in associations.

• some elements are poorly represented (short arrows in the PCA).

Biplots based on PCA of element concentration for 4 sentinel sites.

Results

Page 16: R package “ randomForests ”  Erick Towett

16

Results

•RF model performances were acceptable with R2>0.75.

•Most important variables = cluster, topography, landuse, precipitation and temperature,

• The importance of cluster explained by spatial correlation at distances of < 1 km.

(a)

(b)

Dim 1 (R2=0.92, rmsep=0.47) Dim 2 (R2=0.84, rmsep=0.40) Dim 3 (R2=0.79, rmsep=0.37)

Dim 1 (R2=0.90, rmsep=0.52) Dim 2 (R2=0.80, rmsep=0.51) Dim 3 (R2=0.75, rmsep=0.41)

Variable importance plots showing the model accuracies & mean decrease in accuracy (%IncMSE) of the Random Forests regression of TXRF element concs against mineralogy + site/soil-forming factors (a) including cluster and (b) excluding cluster.

Page 17: R package “ randomForests ”  Erick Towett

17

Study 2:

Potential of combining MIR & TXRF spectroscopy for the prediction of soil properties

Objectives: to evaluate whether TXRF can complement MIR for predicting soil test values,

especially for tests that are poorly predicted by MIR (e.g. extractable P and K; and some micronutrients).

Usage

Page 18: R package “ randomForests ”  Erick Towett

18

Materials and Methods

• Georeferenced soil samples associated with the AfSIS Project.

A total of 700 soil samples 44 random 100-km2 sentinel sites, stratified according to Köppen-Geiger climatic zones distributed across SSA.

Page 19: R package “ randomForests ”  Erick Towett

• Samples were analysed using MIR spectrometer.

19

Fourier-Transform MIR spectrometer

• Infrared absorbance spectra were recorded at 4 cm-1 intervals in the range of 400 to 4000 cm-1.

• The average of the spectra for 4 replicates was taken.

• TXRF methodology for total elemental concentrations in each soil sample.

TXRF spectrometer

Materials and Methods

Page 20: R package “ randomForests ”  Erick Towett

• RF-OOB calibration models developed (n= 700). to predict the reference properties from the TXRF total element

composition using the raw total element concentration data as ‘spectra’.

• Raw TXRF spectra in conjunction with 1st derivative MIR spectra to predict the reference soil properties.

• RF used to calibrate the residuals of the predictions from the MIR spectral data to the raw TXRF total element data as mixing different data types in the predictor variables might affect the

variable importance weights in the fitted models.

20

Materials and Methods

Page 21: R package “ randomForests ”  Erick Towett

21

Results

•MIR spectra resulted in very good prediction models using RF out-of-bag validation (R2 > 0.80) for

organic C and N, total C and N, exchangeable Ca, Mehlich-3 Al and pH.

• Also predicted well (R2 > 0.60) were

Ca/Mg ratio, exchangeable bases, exchangeable Mg, phosphorus sorption index (PSI) water- and calgon-dispersed

particles analysed by laser diffraction for sand content, clay content, and silt content.

Page 22: R package “ randomForests ”  Erick Towett

22

Results

• Calibration models were not satisfactory (R2<0.60) Mehlich-3 extractable K, Mn, Fe, Cu, B, Zn, P, S, and Na, exchangeable acidity, electrical conductivity (ECd), exchangeable sodium percentage (ESP), exchangeable sodium ratio (ESR), air-dispersed particles for silt content, clay content and sand contents.

Page 23: R package “ randomForests ”  Erick Towett

23

Results

•RF was able to improve prediction accuracies if the raw TXRF spectra was added to the MIR data.

e.g. ECd (63% reduction in rmse), Mehlich-3 S (54), exchangeable Na (53%), ESP (50%), ESR (50%), total C (29%), Mehlich-3 B (28%), Mehlich-3 Mn (26%), exchangeable Mg (17%), Mehlich-3 Cu (15%), Mehlich-3 Fe (11%), organic C (10%), Mehlich-3 Zn (6%), and silt content (8-50 microns) air-dispersed particles by laser diffraction (4%)). The improvement in the predictions was mostly due to TXRF detecting a few

outlier samples that were different from the rest of the samples.

•TXRF data used as a predictor did not add value to MIR beyond identifying outlying samples,

these could not be detected as MIR spectral outliers hence TXRF may be used as an outlier detector.

22

Page 24: R package “ randomForests ”  Erick Towett

24

Study 3:

• Analysis of MIRS randomForests prediction models for soil properties.

• Ongoing study attempt to offer an in-depth analysis of random forests models for the

prediction of a number of soil properties using MIR spectroscopy.

Usage

Page 25: R package “ randomForests ”  Erick Towett

25

Materials and Methods

• 1907 soil samples scanned through MIR spectrometer at a resolution of 4 cm-1 .

• 1st derivative of the spectral range 601.7-4001.6 cm-1 calculated smoothing interval of 21 data points using the soil.spec package in R.

• RF-OOB built to predict the reference properties from the MIRS 1st derivative spectra using the entire data set.

Page 26: R package “ randomForests ”  Erick Towett

26

Preliminary Results

Page 27: R package “ randomForests ”  Erick Towett

27

Demo:

R package “randomForests”

Page 28: R package “ randomForests ”  Erick Towett

28

R package “randomForests”

Thank you for your attention