Modeling the Human Classification of Galaxy Morphology
description
Transcript of Modeling the Human Classification of Galaxy Morphology
Modeling the Human Classification of Galaxy Morphology
Wednesday, December 5, 2007Mike Specian
Galaxy Zoo Statistics
• Site announced on July 15, 2007• Over 50,000 volunteers within first week• Most galaxies classified 10 times or more• More classifications = better data• Probably world’s most robust morphology
database with millions of objects classified
Data Preprocessing
Data Preprocessing
• 1, 11 = Elliptical• 2, 12 = Clockwise Spiral• 3, 13 = Counterclockwise Spiral• 4, 14 = Other (Edge-On Spiral)• 5, 15 = Star / Don’t-Know• 6, 16 = Galaxy Merger
How People VotedType Number Classified
Elliptical 666,679
Spiral 94,429
Other (Edge-On) 112,148
Star / Don’t Know 23,735
Galaxy Merger 11,846
There’s almost too much data!
Limiting the sample:1. Model on 10,000 objects2. Distinguish only between ‘Elliptical’ and ‘Spiral’3. Accept objects that received >= 60% of the total vote
Two Data Sets
Set 1Only contains information that human
eyes could use to distinguish morphology. (30 attributes)
Examples: Petrosian flux, Petrosian radius, radius containing 50% and 90% of Petrosian flux, Adaptive Shape Measures, DeVaucouleurs fits, Exponential fits
Set 2Contains additional information likely
correlated to morphology, but for which human eyes on Galaxy Zoo do not have access. (71 attributes)
Examples: Light polarization (Stokes parameters), DeVaucouleurs magnitude fits, dereddened magnitudes, redshift
For Set 1 all categories are measured in the telescope’s three visible color filters. For Set 2, all, save redshift, are measured with all 5 filters.
Feature data pulled from Sloan Digital Sky Survey Data Release 6
How many trees inan ideal random forest?
0 20 40 60 80 100 120 140 160 180 20086
87
88
89
90
91
92
Accuracy vs. Number of Trees in Random Forestfor Abbreviated Galaxy Zoo Data Set
Number of Trees
Accu
racy
Accuracies above trained on 2179 instances, ~50/50 spiral/elliptical, 66% holdout
Probing Learning Rate andMomentum in ANN
Momentum
Learning Rate
.10 .15 .20 .25 .30.20 83.00 82.86 82.86 82.86 82.86
.25 82.86 82.19 82.59 82.59 82.46
.30 82.59 82.73 82.59 81.51 83.27
.35 82.46 82.86 83.00 82.59 82.05
.40 83.40 82.59 81.92 81.78 81.65
Accuracies above trained on 2179 instances, ~50/50 spiral/elliptical, 66% holdout
4.16.82 To 3 Sigma ->
Quantifying Estimator ErrorNumber of Folds Accuracy
2 95.20
4 95.54
6 95.59
8 95.46
10 95.46
12 95.70
14 95.75
16 95.61
18 95.54
20 95.68
Example taken from Random Forest, Data Set 2, 15 Trees
Average = 95.6Standard Deviation = 0.158
All errors taken to 3 sigma.
Error = 95.6 0.5
Conclusions
• Naïve Bayes is not the way to go.• Random Forests, ANN, and SVM all have small
variances, high accuracies• Spirals harder to identify (need more training
instances, or has human bias taken over?)• Including information beyond what the human
eye can see is, remarkably, helpful.