RandomForests for Biomedical Applications

59
Random Forests and Archetypal Analysis of Dietary Patterns in the Cache County Memory Study Adele Cutler Department of Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392- 01

Transcript of RandomForests for Biomedical Applications

Page 1: RandomForests for Biomedical Applications

Random Forests and Archetypal Analysis of Dietary Patterns in the

Cache County Memory Study

Adele CutlerDepartment of Mathematics and Statistics

Utah State University

This research is partially supported by NIH 1R15AG037392-01

Page 2: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 2

Leo Breiman, 1928 - 2005

1984 CART

1994 Archetypal Analysis

1996 Bagging

2001 Random Forests

Page 3: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 3

Example 1: Cookbooks and nutrition

• 300 recipes from 12 cookbooks• Nutritional information (33 predictors)

Joint work with Sheryl AguilarMichael Lefevre

Center for Advanced Nutrition, Utah State University

Page 4: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 4

Example 2: The Cache County Memory Study

Page 5: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 5

Utah

Page 6: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 6

Cache Valley, Utah

Page 7: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 7

Utah State University

Page 8: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 8

Example 2: The Cache County Memory Study

• Prospective, population-based study, 1995-2006

• 5,092 people aged 65 and over • Food frequency questionnaire

Joint work with Heidi Wengreen2

Chris Corcoran1

Anna Quach1

1Mathematics and Statistics, Utah State University2Nutrition and Food Sciences, Utah State University

Page 9: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 9

Outline

• RF for cookbooks• RF for memory study

• Archetypes for cookbooks• Archetypes for memory

• Current development

Page 10: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 10

Random Forests

Page 11: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 11

Random Forests for Classification

Example 1 (cookbooks): • Predict the author of a recipe based on the

nutritional content of the recipe• Which variables are important?

Example 2 (memory): • Predict a person’s dementia status (yes/no) based

on their diet• Which variables are important?

Page 12: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 12

Example 1: Cookbooks

?

Page 13: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 13

Cookbooks: Predict the author?Cookbook Error Rate (%)AHA 4

Cookbook 2 40

Cookbook 3 59

Cookbook 4 95

Cookbook 5 79

Cookbook 6 65

Cookbook 7 91

Cookbook 8 64

Cookbook 9 15

Cookbook 10 92

Cookbook 11 72

Cookbook 12 85

Error rate 63%

Page 14: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 14

Cookbooks: important variables

Error rate 63% • Fat (g)• Saturated fat (g)• Cholesterol (mg)• Monounsaturated fat (g)• Sodium (mg)• Protein (g)• Vitamin B6 (mg)

Page 15: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 15

Two Classes: AHA versus the rest

Error rate 2.33% • Fat (g)• Monounsaturated fat (g)• Saturated fat (g)• Sodium (mg)• Polyunsaturated fat (g)• Protein (g)• Cholesterol (mg)

Page 16: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 16

Two Classes: AHA versus the rest

Error rate 2.33%

Predicted Other AHA Error Rate %

Other 274 1 0.36AHA 6 19 24.00

Class weights!

Page 17: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 17

Class Weights

80% weight AHA, 20% weight “Other”Error rate 5%

Predicted Other AHA Error Rate %

Other 261 14 5.1AHA 1 24 4.0

Page 18: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 18

Salford and R

Different weighting schemes!

• R weights only take a weighted bootstrap sample

• Salford does weighted splits as well

Page 19: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 19

R Weights

0 5 10 15 20 25 30

0.0

00

0.0

05

0.0

10

0.0

15

0.0

20

0.0

25

Variable number

Imp

ort

an

ce

Page 20: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 20

Important Variables (R)

• Fat (g)• Monounsaturated fat (g)• Saturated fat (g)• Sodium (mg)• Polyunsaturated fat (g)

For all weights!

Page 21: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 21

Salford Weights

0 5 10 15 20 25 30

02

46

81

01

21

4

Variable number

Imp

ort

an

ce

Page 22: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 22

Important Variables (Salford)

• Carb (g)• Polyunsaturated fat (g)• Caffeine (mg)• Cholesterol (mg)• Fiber (g)• Protein (g)• Trans fat (g)• Fat (g)

Page 23: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 23

Example 2: Memory

Page 24: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 24

Memory: Predict survivalError rate 28.2%

Predicted Survived Died Error Rate %

Survived 839 591 41Died 359 1584 18

Page 25: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 25

Memory: Predict dementia?Error rate 28.1%

Predicted Normal Demented Error Rate %

Normal 2410 24 0.99Demented 926 13 98.62

Page 26: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 26

Class Weights

30% weight “Other”70% weight AHAError rate 38%

Predicted Normal Demented Error Rate %

Normal 1646 788 32Demented 508 431 54

Page 27: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 27

Salford Weights

0 20 40 60 80

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

Variable number

Imp

ort

an

ce

Page 28: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 28

R Weights

0 20 40 60 80

01

23

4

Variable number

Imp

ort

an

ce

Page 29: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 29

Salford Weights

0 20 40 60 80

02

46

81

0

imp

ort

an

ce

Page 30: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 30

R Weights

0 20 40 60 80

0.0

00

0.0

02

0.0

04

0.0

06

0.0

08

0.0

10

imp

ort

an

ce

Page 31: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 31

Summary

• R weights only take a weighted bootstrap sample

• Salford does weighted splits as well• Salford weights can give different variable

importance

Page 32: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 32

Archetypes

Cutler and Breiman, Technometrics, 1994

• Unsupervised learning, alternative to cluster analysis or PCA

• Summarize data using a fixed number of “archetypes”

• The archetypes are extremes• Data points are approximated by mixtures of

archetypes

Page 33: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 33

Archetypes

Example 1 (cookbooks): • Archetypes represent extreme recipes• A particular recipe is approximated as a mixture of

the extreme recipes

Example 2 (memory):• Archetypes represent extreme dietary patterns• A person’s diet is approximated as a mixture of the

extreme diets

Page 34: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 34

Example 1: Cookbooks

?

Page 35: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 35

Cookbooks: How many archetypes?

2 4 6 8 10

30

03

50

40

04

50

Number of archetypes

RM

SE

Page 36: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 36

1

2 3

Page 37: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 37

1

2 3

4

Page 38: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 38

1

2

3 4

5

Page 39: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 39

1

2

3 4

5

6

Page 40: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 40

1

2

3

4 5

6

7

Page 41: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 41

1

2

3 4

5

6

Cookbook 1

Page 42: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 42

1

2

3 4

5

6

Cookbook 2

Page 43: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 43

1

2

3 4

5

6

Cookbook 3

Page 44: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 44

1

2

3 4

5

6

Cookbook 4

Page 45: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 45

1

2

3 4

5

6

Cookbook 5

Page 46: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 46

1

2

3 4

5

6

Cookbook 6

Page 47: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 47

1

2

3 4

5

6

Cookbook 7

Page 48: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 48

1

2

3 4

5

6

Cookbook 8

Page 49: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 49

1

2

3 4

5

6

Cookbook 9

Page 50: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 50

1

2

3 4

5

6

Cookbook 10

Page 51: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 51

1

2

3 4

5

6

Cookbook 11

Page 52: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 52

1

2

3 4

5

6

Cookbook 12

Page 53: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 53

Example 2: Memory

Page 54: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 54

Memory: How many archetypes?

2 4 6 8 10

2.5

3.0

3.5

4.0

Number of archetypes

RM

SE

Page 55: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 55

1

2

3

4 5

6

7

Color = Dementia Status

Page 56: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 56

1

2

3

4 5

6

7

Color = Smoking Status

Page 57: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 57

1

2

3

4 5

6

7

Color = Drinking Status

Page 58: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 58

1

2

3

4 5

6

7

Color = Age

Page 59: RandomForests for Biomedical Applications

04/13/2023 ADMC 2012 59

Development

Random forests:• Regression version• Case weights • Probability estimates• Proximities• Multivariate outcomes

Archetypes:• Archetypal functions• Archetypal sets