NLCD2001 – C5 and Cubist Training

50
NLCD2001 – C5 and Cubist Training Mike Coan ([email protected]) Limin Yang, Chengquan Huang, Bruce Wylie, Collin Homer Land Cover Strategies Team EROS Data Center, USGS June 2003

description

NLCD2001 – C5 and Cubist Training. Mike Coan ([email protected]) Limin Yang, Chengquan Huang, Bruce Wylie, Collin Homer Land Cover Strategies Team EROS Data Center, USGS June 2003. Overview. Classification tree – C5/See5 General description of the algorithm C5 how-to - PowerPoint PPT Presentation

Transcript of NLCD2001 – C5 and Cubist Training

Page 1: NLCD2001 –  C5 and Cubist Training

NLCD2001 – C5 and Cubist Training

Mike Coan ([email protected])

Limin Yang, Chengquan Huang, Bruce Wylie, Collin Homer

Land Cover Strategies Team

EROS Data Center, USGS June 2003

Page 2: NLCD2001 –  C5 and Cubist Training

Overview

• Classification tree – C5/See5– General description of the algorithm– C5 how-to– An example

• Regression tree – Cubist– General description of the algorithm– Cubist how-to– An example

Page 3: NLCD2001 –  C5 and Cubist Training

C5/See5 – What is it?

• “…a system that extracts informative patterns from data.”

• C5:UNIX version / See5:Windows version

• Predicts categorical variables (ie: land cover)

• www.rulequest.com - vendor and tutorial

Page 4: NLCD2001 –  C5 and Cubist Training

C5 for Land Cover Classification –Why this method?

Compared to other methods such as the maximum likelihood classifier, neural networks etc., the classification tree method-

1) is non-parametric and therefore independent of the distribution of class signature,

2) can handle both continuous and nominal data,

3) generates interpretable classification rules, and

4) is fast to train and is often as accurate as or even slightly more accurate than many other classifiers.

Page 5: NLCD2001 –  C5 and Cubist Training

What does a decision tree model look like?D-tree output syntax

elev <= 1622::...asp <= 2: 81 (62/1): asp > 2:: :...asp <= 9: 41 (12/1): asp > 9: 81 (15)elev > 1622::...slp > 10: 41 (34) slp <= 10: :...pidx > 64: 41 (15) pidx <= 64: :...slp <= 1: 81 (37/13) slp > 1: :...elev > 1885: 41 (42/3) elev <= 1885: :...asp <= 12: :...slp <= 9: 41 (75/24) : slp > 9: 81 (2)

Comparable psuedo-code syntax

If elev <= 1622 if asp <= 2 then landcover = 81 otherwise if asp <= 9 then landcover = 41 otherwise landcover = 81OtherwiseIf elev > 1622 if slp > 10 then landcover = 41 otherwise if pidx > 64 landcover = 41 otherwise if slp <= 1 landcover = 81(and so on…)

Page 6: NLCD2001 –  C5 and Cubist Training

General description of the algorithm

x

y

Page 7: NLCD2001 –  C5 and Cubist Training

Major Steps in Developing a Spatial Classification (map)

using C5

• Collect training points

• Develop a classification tree model (aka decision tree, or d-tree)

• Apply the model spatially to create a map

Page 8: NLCD2001 –  C5 and Cubist Training

Extract Coordinates from the Training Point File

• In an ERDAS IMAGINE image viewer, either create new training data, or load existing point data as an ESRI ARC coverage or shapefile

• From the viewer, select Vector>Attributes… to open the attribute table

• Without selecting any rows, highlight the two columns containing x and y coordinates

• With the cursor at the title row of the highlighted columns, right click the mouse to activate a list of options, select export

• Specify the output file (*.dat), making sure it goes to the desired directory

• Follow the same steps to output the land cover label column (if it exists) to a text file

Page 9: NLCD2001 –  C5 and Cubist Training

Viewer>Vector Attributes…>(select POINT_X and POINT_Y column headers)

Page 10: NLCD2001 –  C5 and Cubist Training

Right click on column headers, select Column Options > Export… > (Specify path of output)

Page 11: NLCD2001 –  C5 and Cubist Training

Extract the Spectral Values of the Training Points:

Utilities > Pixel to ASCII…

Page 12: NLCD2001 –  C5 and Cubist Training

Pixel To Table: Fill in the details1) Specify each input, “Add”, see additions to “Files to export”

2) Type of Criteria, choose “Point File”

3) Specify x,y locations (*.dat) to extract

4) Direct Output File (*.asc) to desired directory

5) “OK”

Page 13: NLCD2001 –  C5 and Cubist Training

Create the *.data file• Copy the Pixel to ASCII (*.dat) results – the header

information will be important!

• Edit the copy, and delete the first few lines defining the input file and bands

• Load this file and the text file containing land cover labels in Excel, copy/paste the land cover label as the last column in the file containing the spectral values

• Save the new file in coma separated format (csv). Rename the *.csv file to *.data (the required extension for c5).

Page 14: NLCD2001 –  C5 and Cubist Training

Creating a *.names file by hand

The first line defines the variable name to be classified, which also appears on the last line with the values to be assigned.

The order of input variables listed in this *.names file must correspond to the order of the data in the *.data file.

Every line ends with “.”, and comments can be inserted after the “.” with “|”.

Syntax is “variable: datatype.” Use brief but descriptive variable names.

Use “ignore.” to exclude certain input layers as desired, such as the coordinate x and y values.

Data can be discrete (integer values, which do not have ranking) or continuous (integer or floating point, with ranking). For example: the topographical derivative layer “aspect” with values 0-17 is a discrete layer. List these aspect values individually.

class. | to be predictedx: ignore.y: ignore.gndvi1: continuous.gndvi2: continuous.gndvi3: continuous.moist1: continuous.moist2: continuous.moist3: continuous.loff1: continuous.loff2: continuous.loff3: continuous.loff4: continuous.loff5: continuous.loff7: continuous.lon1: continuous.lon2: continuous.lon3: continuous.lon4: continuous.lon5: continuous.lon7: continuous.spr1: continuous.spr2: continuous.spr3: continuous.spr4: continuous.spr5: continuous.spr7: continuous.aspect: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17.class: 11,23,41,42,91,92.

This page is included for UNIX use and historical interest – don’t expect to use a handbuilt *.names file with the CART Tool (it crashes…)

Page 15: NLCD2001 –  C5 and Cubist Training

ERDAS Module-Classification And Regression Tree

(CART)New ERDAS Imagine module, only for

Windows platform, to implement See5 and Cubist tools

CART Utilities > CART Sampling Tool…

Can create both the *.data and *.names files, but still will require some knowledgeable editing…

Page 16: NLCD2001 –  C5 and Cubist Training

CART Module, ERDAS 8.6, Windows XP

Page 17: NLCD2001 –  C5 and Cubist Training

CART Sampling Tool for See5A raster image of training points is being used!

Background values in the training raster are set to 0 (not 255)

Select all your inputvariables, fill outthe sampling list

Maximize the number of training points to use – but don’t try 100%! It crashes…

Page 18: NLCD2001 –  C5 and Cubist Training

CART Sampling Tool - *.names| Generated with cubistinput by EarthSat| Training samples : 504| Validation samples: 0| Minimum samples : 0| Sample method : Random| Output format : See5

dep. |d:/nlcd2000/training/c5/z16/train/trainingpts.img(:Band_1)

Xcoord: ignore.Ycoord: ignore.band01: continuous. |d:/nlcd2000/training/c5/z16/tc/tcoff.img(:Layer_1)band02: continuous. |d:/nlcd2000/training/c5/z16/tc/tcoff.img(:Layer_2)band03: continuous. |d:/nlcd2000/training/c5/z16/tc/tcoff.img(:Layer_3)band04: continuous. |d:/nlcd2000/training/c5/z16/tc/tcon.img(:Layer_1)band05: continuous. |d:/nlcd2000/training/c5/z16/tc/tcon.img(:Layer_2)band06: continuous. |d:/nlcd2000/training/c5/z16/tc/tcon.img(:Layer_3)band07: continuous. |d:/nlcd2000/training/c5/z16/tc/tcspr.img(:Layer_1)band08: continuous. |d:/nlcd2000/training/c5/z16/tc/tcspr.img(:Layer_2)band09: continuous. |d:/nlcd2000/training/c5/z16/tc/tcspr.img(:Layer_3)band10: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17. |d:/nlcd2000/training/c5/z16/topo/aspect.img(:Layer_1)band11: continuous. |d:/nlcd2000/training/c5/z16/topo/dem.img(:Layer_1)band12: continuous. |d:/nlcd2000/training/c5/z16/topo/posindex.img(:Layer_1)band13: continuous. |d:/nlcd2000/training/c5/z16/topo/slope.img(:Layer_1)band14: continuous. |d:/nlcd2000/training/c5/z16/b9/b9off.img(:Layer_1)band15: continuous. |d:/nlcd2000/training/c5/z16/b9/b9on.img(:Layer_1)band16: continuous. |d:/nlcd2000/training/c5/z16/b9/b9spr.img(:Layer_1)

dep: 11,31,41,42,43,52,53,54,71,81,82,91,92. |d:/nlcd2000/training/c5/z16/train/trainingpts.img(:Band_1)

List of input variables (count ‘em: 18)

dep – the dependent variable, aka “the thing to be estimated”

Page 19: NLCD2001 –  C5 and Cubist Training

CART Sampling Tool - *.data

-1431270,1861950,87,80,84,110,68,63,103,74,71,1,1783,50,2,154,215,199,42

-1369020,1862160,120,60,52,145,57,32,108,65,63,1,1690,16,2,185,244,199,52

-1368780,1862070,70,75,95,96,72,73,77,74,94,13,1747,54,17,175,234,181,71

-1368990,1862010,81,69,72,105,70,55,82,78,80,13,1698,21,4,181,243,192,71

-1368660,1861950,89,67,81,104,67,74,84,72,87,13,1772,50,14,181,234,184,71

-1369470,1861710,76,77,102,112,66,73,84,73,91,5,1748,28,21,174,235,187,42

-1370070,1859610,68,83,101,86,80,84,71,81,97,6,1830,26,10,171,232,191,42

-1369950,1859580,62,80,94,84,78,77,64,81,95,1,1817,40,4,179,235,193,52

-1369920,1859400,58,82,97,84,76,75,66,81,93,15,1836,40,9,169,229,188,52

-1369230,1858290,66,77,88,89,71,71,74,78,85,1,2026,82,3,175,228,188,52

Xcoord, Ycoord, band01,…, band16, dep

Order of variables matches the *.names file

Page 20: NLCD2001 –  C5 and Cubist Training

Running See5 – Locate DataStart->Programs->See5See5->1st Icon: Locate Data->(point to *.data)

(See5 finds associated *.names file in same directory, too)

Page 21: NLCD2001 –  C5 and Cubist Training

Running See5 – Construct ClassifierSee5->2nd Icon: Construct Classifier->(select options)

Typically, try a series of runs:

1) Cross-validate to evaluate the training data and come up with a preliminary accuracy estimate,

2) Boosting, and

3) with neither.

Save each of the resulting *.out files – rename them to something unique, so they don’t get overwritten.

Page 22: NLCD2001 –  C5 and Cubist Training

Run C5 (UNIX server)

• Command syntax: c5 –f filestem

• Output model is saved in a file called

filestem.tree

• The model can also be viewed as text on the screen, or redirected to a text file:

c5 –f filestem > filestem.out

Page 23: NLCD2001 –  C5 and Cubist Training

Model evaluation

1) Create an optional test file (filestem.test), containing training points withheld from developing the decision tree, to be used exclusively for evaluation

2) Run C5/See5 with the Cross-validation option – more realistic than training accuracy, and uses all training data sequentially (none withheld).

More than one method to assess training accuracy-

Page 24: NLCD2001 –  C5 and Cubist Training

Cross validation (-X option)1. Divides the training samples into n equal sized

subsets 2. Develops a tree model using (n-1) subsets of

training points and evaluate the model using the remaining subset

3. Repeats Step 2 for n times, each time using a different subset for evaluation

4. Averages the results of these n tests5. Command Syntax: c5 –f filestem –X n

Page 25: NLCD2001 –  C5 and Cubist Training

Pruning the d-tree model (-m or -c options)

• A d-tree model can be overfitted.• Two options to control overfitting

-m: specifies the minimum number of pixels in a node which can no longer be split. Larger m values increase severity of pruning. Default value is 2.

-c: lessening this value causes more severe pruning of a tree model. Default value is 25, allowable range is 0-100.

Page 26: NLCD2001 –  C5 and Cubist Training

Boosting (-b option)

• Develops an initial tree model using all training points and classifies them

• With higher weights assigned to the misclassified points, resamples the same number of points from the original training data set and builds another tree model, and uses the new model to classify the original points

• Repeats Step 3 several times (default is 10)• All developed d-tree models are used to classify new

sample points, with the final prediction a weighted vote of the predictions of those models

• Command syntax: c5 –f filestem -b

Why? Often improves accuracy by 5% or more!

Page 27: NLCD2001 –  C5 and Cubist Training

Develop a Spatial Classification (map)

A mask file is handy for speeding up the processing, and addressing all the pixels of an irregularly shaped mapping area.

An error, or confidence layer, can also help identifying areas to inspect for evaluation – and maybe a guide for where new or additional training data is needed.

Page 28: NLCD2001 –  C5 and Cubist Training

Review: Steps to Develop a Classification Tree Model

• Required files

• Optional file

• Cross-validation

• Pruning

• Boosting

Page 29: NLCD2001 –  C5 and Cubist Training

Review: Files for Running C5

• filestem.names – attribute table, required

• filestem.data – training data, required

• (filestem.test – withheld training data test file, optional)

Page 30: NLCD2001 –  C5 and Cubist Training

1577,15,7,66,4,811499,19,5,50,4,811485,20,1,0,1,811507,10,10,50,4,811534,10,1,50,4,811548,1,1,50,4,811562,0,1,50,1,811542,13,17,33,1,81

lc. | to be classified

elev: continuous.

slp: continuous.

asp: discrete 17.

pidx: continuous.

lform: 0,1,2,3,4,5,6.

lc: 0,41,52,81.

taskname.names taskname.data

Review: Examples of Required *.names and *.data Files

The order of the variables in the *.data file must match the order of the variables in the *.names file!

Page 31: NLCD2001 –  C5 and Cubist Training

Finally! Run an ExampleSample Data: Subset of Southern Utah (zone 16)Reflectance (bands 1,2,3,4,5,7 of leaf on, leaf off, spring dates): D:\Workspace\c5training\z16\refl

Tasselled Cap (wetness, greeness, brightness of same dates): D:\Workspace\c5training\z16\tc

Topographic Derivatives (aspect, DEM, position index, slope): D:\Workspace\c5training\z16\topo

Thermal (band 9 for each date): D:\Workspace\c5training\z16\b9

Date Mosaics (how each date mosaic was assembled): D:\Workspace\c5training\z16\date

Training data in ARC/Info export coverage format: D:\Workspace\c5training\z16\trainingpts.e00

Note: columns “point_x” and “point_y” are Albers x and y coordinates, and “nlcd2000” (last column) is the land cover label.

Page 32: NLCD2001 –  C5 and Cubist Training

Another Example – With Your Own Training Data

Sample Data: Subset of Northern Minnesota (zone 41)Reflectance (bands 1,2,3,4,5,7 of leaf on, leaf off, spring dates): D:\Workspace\c5training\z41\refl

Tasselled Cap (wetness, greeness, brightness of same dates): D:\Workspace\c5training\z41\tc

Ratio Indices (Green NDVI, Moisture): D:\Workspace\c5training\z41\ratio

NOTE: These ratio indices are nonstandard! Here given only as examples of value-added, user specific input layers for determining woody-wetlands.

Thermal (band 9 for each date): D:\Workspace\c5training\z41\b9

Training data must be created by screen interpretation. Go to the Minnesota DNR website: www.ra.dnr.state.mn.us/forestview, and choose “ForestView +”. Specify our subset area, township T62R17W. With the online stand information, and a new point coverage in your ERDAS Imagine Viewer, create your own training data.

Page 33: NLCD2001 –  C5 and Cubist Training

Regression tree – cubist

For estimating continuous variables like percent canopy cover and height, etc.

Page 34: NLCD2001 –  C5 and Cubist Training

Methods for estimating a continuous variable Physically-based models (e.g. Li and Strahler (1992))

Too complex to be inverted Spectral mixture models:

End-members – green vegetation, non-photosynthetic vegetation, soil etc., not directly interpretable as the target variables;

Assumptions on spectral mixing may not be valid; Empirical models -- results directly interpretable as the target variables

Linear regression cannot approximate non-linear relationships

Regression tree can approximate complex nonlinear relationships

Neural net

Page 35: NLCD2001 –  C5 and Cubist Training

0

10

20

30

40

50

60

70

80

90

100

0 5 10 15 20 25 30

independent variable

de

pe

nd

en

t v

ari

ab

le

Regression tree cartoon

Page 36: NLCD2001 –  C5 and Cubist Training

Steps to developing a spatial map using Cubist

• Collect training points

• Develop a regression tree model

• Apply the model spatially to create a map

• Masking

Page 37: NLCD2001 –  C5 and Cubist Training

Develop a regression tree model

• Required files• Optional files• Cross-validation• Pruning• Committee model• Composite model

Page 38: NLCD2001 –  C5 and Cubist Training

Files for running Cubist

• filestem.names – attribute table, required

• filestem.data – training data, required

• filestem.test – test file, optional

• filestem.cases – data to be classified, optional

Page 39: NLCD2001 –  C5 and Cubist Training

1577,1,1,66,4,151499,1,1,50,4,891485,1,1,0,1,201507,1,1,50,4,451534,0,1,50,4,601548,1,1,50,4,51562,0,1,50,1,1001542,1,1,33,1,70

treecover. | target

elev: continuous.

slp: continuous.

asp: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17.

pidx: continuous.

lform: 0,1,2,3,4,5,6.

treecover: continuous.

filestem.names filestem.data

Example Files – Just Like C5Except! The *.names target variable is continuous, not discrete…

Page 40: NLCD2001 –  C5 and Cubist Training

Cubist Training DataWe use Cubist to estimate CONTINUOUS data (% impervious surface, % tree canopy). Training data is generated by classifying high resolutionimagery (DOQQ’s, IKONOS) – use C5/See5 to classify!

Typically, impervious surfaces have a value of 1, shadows (unknown) have a value of 2, and all others go to 0.

The CART Utilities->Percent Calculation… computes a rescaled high-res image of 1m (DOQQ) or 4m (IKONOS) pixels to a block of estimated percent coverages on 30m Landsat-compatable pixels

Page 41: NLCD2001 –  C5 and Cubist Training

Run CART Sampling Tool (Cubist XP)Make the 30m dependent variable (continuous, values of 0 through 100) with area of full extent from a model (like union255.gmd), but with value 255 where it was padded.

Pick many thousands of Training points, and a non-zero number of points for Valdiation (will crash if set to zero).

Stratified Sampling, with Minimum of 50 Samples per bin (0-100).

Page 42: NLCD2001 –  C5 and Cubist Training

Run Cubist (XP)Start->Programs->CubistCubist->1st Icon-> Locate Data->(point to *.data)

Page 43: NLCD2001 –  C5 and Cubist Training

Running Cubist – Build ModelCubist->2nd Icon: Build Model->(select options)

NOTE: Cross Validation helps with preliminary assessment of the training data, but it DOES NOT generate a *.model file.

Run again, withoutCross Validation, to make the required *.model

Page 44: NLCD2001 –  C5 and Cubist Training

CART – Cubist Run…

The estimated impervious percentage for the entire area!

A mask could be used to eliminate water bodies, or to restrict the classification to road buffered areas combined with prior urban classes.

Page 45: NLCD2001 –  C5 and Cubist Training

Run Cubist (UNIX Server)

Command syntax: cubist –f filestem

• Model is saved in a file called filestem.model

• The model can also be viewed as text on the screen, or redirected to a text file:

cubist –f filestem –e 0 > filestem.out “-e 0” is to prevent extrapolation (beyond 100%)

Page 46: NLCD2001 –  C5 and Cubist Training

Regression Tree Model Evaluation – Just Like C5

Two methods of assessing training accuracy

• Can provide a test file of reserved training data – filestem.test

• Cross-validation – more realistic than training accuracy, uses all training data

Page 47: NLCD2001 –  C5 and Cubist Training

Pruning Regression Tree models

An r-tree model can be over fitted.

Two options to control over fitting-m: specifies the minimum number of pixels in a

node which can no longer be split. Default value is 1% of training points.

-r: specifies the maximum number of output rules

Page 48: NLCD2001 –  C5 and Cubist Training

Committee model

Similar to the boosting function of C5.

Page 49: NLCD2001 –  C5 and Cubist Training

Composite model

• In addition to a regular regression tree model, can also make a prediction using the K-nearest neighbor (KNN) method

• Final prediction is a combination of both• Initiated by option “-i”, or the program will

determines if composite model is needed if option “-a” is used.

Page 50: NLCD2001 –  C5 and Cubist Training

Run a Cubist examplePercent Imperviousness of Columbus, GA:

Two training sites of 1 meter resolution color DOQQ, already classified by unsupervised classifications into tree, grass, shadow, bare, water, and impervious surface (6 classes).

Combine training sites, and recode so that impervious has value of 1, and shadow has value of 2, all others go to 0. Resample recoded file, from higher resolution to 30 meter estimated percent impervious. Increase area to match extent of L7 imagery, padding with values of 255: see “union255.gmd”.

Use this file, along with two dates of L7 imagery, leaf-on and leaf-off to estimate percent imperviousness for entire area of Landsat data.