Modeling the Human Classification of Galaxy Morphology

14
Modeling the Human Classification of Galaxy Morphology Wednesday, December 5, 2007 Mike Specian

description

Modeling the Human Classification of Galaxy Morphology. Wednesday, December 5, 2007 Mike Specian. Galaxy Zoo Statistics. Site announced on July 15, 2007 Over 50,000 volunteers within first week Most galaxies classified 10 times or more More classifications = better data - PowerPoint PPT Presentation

Transcript of Modeling the Human Classification of Galaxy Morphology

Page 1: Modeling the Human Classification of Galaxy Morphology

Modeling the Human Classification of Galaxy Morphology

Wednesday, December 5, 2007Mike Specian

Page 2: Modeling the Human Classification of Galaxy Morphology
Page 3: Modeling the Human Classification of Galaxy Morphology
Page 4: Modeling the Human Classification of Galaxy Morphology

Galaxy Zoo Statistics

• Site announced on July 15, 2007• Over 50,000 volunteers within first week• Most galaxies classified 10 times or more• More classifications = better data• Probably world’s most robust morphology

database with millions of objects classified

Page 5: Modeling the Human Classification of Galaxy Morphology

Data Preprocessing

Page 6: Modeling the Human Classification of Galaxy Morphology

Data Preprocessing

• 1, 11 = Elliptical• 2, 12 = Clockwise Spiral• 3, 13 = Counterclockwise Spiral• 4, 14 = Other (Edge-On Spiral)• 5, 15 = Star / Don’t-Know• 6, 16 = Galaxy Merger

Page 7: Modeling the Human Classification of Galaxy Morphology

How People VotedType Number Classified

Elliptical 666,679

Spiral 94,429

Other (Edge-On) 112,148

Star / Don’t Know 23,735

Galaxy Merger 11,846

There’s almost too much data!

Limiting the sample:1. Model on 10,000 objects2. Distinguish only between ‘Elliptical’ and ‘Spiral’3. Accept objects that received >= 60% of the total vote

Page 8: Modeling the Human Classification of Galaxy Morphology

Two Data Sets

Set 1Only contains information that human

eyes could use to distinguish morphology. (30 attributes)

Examples: Petrosian flux, Petrosian radius, radius containing 50% and 90% of Petrosian flux, Adaptive Shape Measures, DeVaucouleurs fits, Exponential fits

Set 2Contains additional information likely

correlated to morphology, but for which human eyes on Galaxy Zoo do not have access. (71 attributes)

Examples: Light polarization (Stokes parameters), DeVaucouleurs magnitude fits, dereddened magnitudes, redshift

For Set 1 all categories are measured in the telescope’s three visible color filters. For Set 2, all, save redshift, are measured with all 5 filters.

Feature data pulled from Sloan Digital Sky Survey Data Release 6

Page 9: Modeling the Human Classification of Galaxy Morphology

How many trees inan ideal random forest?

0 20 40 60 80 100 120 140 160 180 20086

87

88

89

90

91

92

Accuracy vs. Number of Trees in Random Forestfor Abbreviated Galaxy Zoo Data Set

Number of Trees

Accu

racy

Accuracies above trained on 2179 instances, ~50/50 spiral/elliptical, 66% holdout

Page 10: Modeling the Human Classification of Galaxy Morphology

Probing Learning Rate andMomentum in ANN

Momentum

Learning Rate

.10 .15 .20 .25 .30.20 83.00 82.86 82.86 82.86 82.86

.25 82.86 82.19 82.59 82.59 82.46

.30 82.59 82.73 82.59 81.51 83.27

.35 82.46 82.86 83.00 82.59 82.05

.40 83.40 82.59 81.92 81.78 81.65

Accuracies above trained on 2179 instances, ~50/50 spiral/elliptical, 66% holdout

4.16.82 To 3 Sigma ->

Page 11: Modeling the Human Classification of Galaxy Morphology
Page 12: Modeling the Human Classification of Galaxy Morphology

Quantifying Estimator ErrorNumber of Folds Accuracy

2 95.20

4 95.54

6 95.59

8 95.46

10 95.46

12 95.70

14 95.75

16 95.61

18 95.54

20 95.68

Example taken from Random Forest, Data Set 2, 15 Trees

Average = 95.6Standard Deviation = 0.158

All errors taken to 3 sigma.

Error = 95.6 0.5

Page 13: Modeling the Human Classification of Galaxy Morphology
Page 14: Modeling the Human Classification of Galaxy Morphology

Conclusions

• Naïve Bayes is not the way to go.• Random Forests, ANN, and SVM all have small

variances, high accuracies• Spirals harder to identify (need more training

instances, or has human bias taken over?)• Including information beyond what the human

eye can see is, remarkably, helpful.