A project of “Imaging Regression, Classification and ... · A project of “Imaging Regression,...

A project of “Imaging Regression, Classification and Clustering,”

a sub-WG of the Imaging WG

1

Active members:•Ashish Mahabal•Julian Faraway•Jiayang Sun

•Grace Wang•Xiaofeng Wang•Lingsong Zhang

TransientNon variableBrighter obj. Light Curves

StatisticsImages

• Objectives• Classification, in Real-time, using minimal data• Impact in terms of larger pictures

• Challenges– Heterogeneity of data sources

• CRTS numbers, LSST numbers, minimal overlap (quote DASCH), part of larger set of parameters

– Large and massive amount of light curves– Missing data, measurement errors and irregularly

sampled data, ….• Data Sets– 3 Data sources and 2.67 types (light curves, stats, images)

2

Presenter

Presentation Notes

----- Meeting Notes (5/20/13 17:06) ----- [1] larger picture in next couple of slides [2] Heterogeneity: CRTS, LSST (fainter than CRTS’s), DASCH is brighter, [3] Massive: > 500m+ light curves [4] Missing, measurement errors are heteroscedastic, in CRTS, (4 images in 30 s)

Data

1. Transients from CRTS (3 types)2. Mostly non-variables: Objects @ random

locations – also used by Astro group (2 types)3. Brighter samples of CVs and RR-Lyrae –

important for connecting datasets (e.g. many brighter CRTS objects will saturate LSST, just like almost all DASCH sources are saturated in CRTS) [Some are transients and some are periodic]

3

Presenter

Presentation Notes

RR Lyrae are periodic

What is a transient?

Fast transient (flaring dM), CSS080118:112149–131310

4 individual exposures, separated by 10 min Light curve

One that has a large brightness change (delta-magnitude) within a short timespan (small delta-time)

4

Presenter

Presentation Notes

Object in the central square is the transient. Other objects in the 4 images are seen not to change their brightness

Data CharacteristicsClassifying (all) transients (in real time) is hard • Too many ‘ordinary’ transients

– Finding needles in a hay stack

• Too many possible ‘parameters’ – e.g. colors, positions, flux = # photons

CRTS --> LSST 5

Presenter

Presentation Notes

Finding large number of transients has become possible only in the last few years The numbers will ramp up dramatically (e.g. LSST, SKA = square kilometer array) SKA will be a survey at radio wavelengths

LBV

AGNAsteroids

RotationEclipse

Microlensing Eruptive PulsationSecular

(DAV) H-WDs

Variability Tree

NovaeN

SymbioticZAND

Dwarf novae

UG

Eclipse

Asteroid occultation

Eclipsing binary

Planetary transits

EA

EB

EW

Rotation

ZZ CetiPG 1159

Solar-like

(PG1716+426, Betsy)long period sdB

V1093 Her

(W Vir)Type II Ceph.δ Cepheids

RR Lyrae

CW

Credit : L. Eyer & N. Mowlavi (03/2009)

(updated 04/2013) δ Scuti

γ Doradus

Slowlypulsating B stars

α Cygni

β Cephei

λ Eri

SX Phoenicis

Hot OB Supergiants

ACYG

BCEP

SPBe

GDOR

DST

PMSδ Scuti

roAp

Miras

Irregulars

Semi-regulars

M

SRL

RV

SARVSmall ampl. red var.

(DO,V GW Vir)He/C/O-WDs

PV TelHe star

Be stars

RCB

GCASFU

UV Ceti

Binary red giants

α 2 Canes VenaticorumMS (B8-A7) withstrong B fields

SX ArietisMS (B0-A7) withstrong B fields

Red dwarfs(K-M stars)

ACV

BY Dra

ELL

FKCOMSingle red giants

WR

SXA

β Per, α Vir

RS CVn

PMS

S Dor

Eclipse

(DBV) He-WDs

V777 Her

(EC14026)short period sdB

V361 Hya

RV Tau

Photom. Period.FG SgeSakurai,V605 Aql

R Hya (Miras)δ Cep (Cepheid)

DY Per

Supernovae

SN II, Ib, IcSN Ia

Extrinsic

Radio quiet Radio loud

Seyfert I

Seyfert 2

LINER

RLQ

BLRG

NLRG

WLRG

RQQ

OVVBL Lac

Blazar

Stars Stars

Intrinsic

CEPRR

SXPHESPB

Cataclysmic

Challenge 1: Characterize/Classify as much with as little data as possible

We concentrate here on lightcurves (time series)6

Presenter

Presentation Notes

Some changes inside the node, extrinsic: Ashish please add …. Color notes + intrinsic vs. extrinsic Intrinsic variation is when something happens inside the object Extrinsic when it is due to some movement (e.g. eclipse or rotation) AGNs are Active Galactic Nuclei Blazars are radio-loud (emit more energy at radio wavelengths) Those two categories and supernovae are extra-galactic Cataclysmic Variables (CVs) and RR-Lyrae that we see are from within our own Galaxy

Challenge 2: Only a small fraction is rare*

• Current Status: – About 1 strong (but mostly ‘ordinary’) transient/106 sources by machine– High threshold to pick most dramatic transients (identification by human)

• Future: – With LSST, a million transients will be found per night, which is why we need

automatic classification algorithms

CRTS statistics as of May 2013: http://nesssi.cacr.caltech.edu/catalina/Stats.html

SNe Ast/Flr

Tel All OTs SNe CVs Blazars Ast/flares CV/SN AGN

Other

CSS 3390 1003 676 216 269 436 438 438

MLS 3387 479 69 75 225 600 1597 522

SSS 680 99 251 17 11 108 32 167

SNHunt 186 186 0 0 0 0 0 0

Total 7643 1767 966 308 505 1144 2067 1127

7

Presenter

Presentation Notes

updated

A Blazar: a variability-based counterpart of a previously unidentified Fermi source.

8

Example: Blazar - its type was confirmed by the spectrum graph on the right

Presenter

Presentation Notes

Blazar lightcurve. Hight state simply means its bright state, right figure is the spectrum that confirms the object to be a blazar (mainly blue continuum)

Challenge 3: A Variety of Parameters • Discovery: magnitudes, delta-magnitudes• Contextual:

– Distance to nearest star– Magnitude of the star– Color of that star– Normalized distance to nearest galaxy– Distance to nearest radio source– Flux of nearest radio source– Galactic latitude

• Follow-up– Colors (g-r, r-I, i-z etc.)

• Prior classifications (event type)• Characteristics from light-curve

– Amplitude– Median buffer range percentage– Standard deviation– Stetson k– Flux percentile ratio mid80– Prior outburst statistic

Not all parameters are always present leading to swiss-cheese like data sets

http://ki-media.blogspot.com/

New lightcurve-based parameters:•Whole curve measures•Fitted curve measures•Residual from fit measures•Cluster measures•Other

9

Challenge 4: Lightcurve demonstrating upper limits

10

Presenter

Presentation Notes

Red triangles indicate upper limits (observations done, but no object detected) Such truncated light curves have not been studied presently by our group. Its part of the near future plan

Our ApproachesMethods (recall our objective: Classification): 1. Modern EDA before classification on stats, lightcurves in 1-d

and high-d (graphical computation, SiZer and PP) 2. Improvement from 4 directions:

1. Better with new derived statistics2. Better classification procedure (single, ensemble) 3. Better with previously ignored information

‘semi-supervised’ learning4. Better in terms of using less or incremental approachNotes: Classification based on derived statistics or entire

curve (2-4)3. Methodology Development

11

EDA on Non-Variables

12

Presenter

Presentation Notes

Talk about variability in sample size and sample spacing. Talk about natural variation even in non-variables. Mention that some cases might even be variables. The gaps are indicative of times when an object is not seen from Earth. The clusters are annual clusters, with the data spanning about 8 years.

EDA on a transient (change si sudden)

13

Presenter

Presentation Notes

Blue line is least squares fit. Point out how flares are noticeable in some case but not so obvious in others.

EDA on a transient with changes that can take a few ys

14

Presenter

Presentation Notes

Note the aperiodic pattern of variation

EDA on a sub-group: active galactic nuclei, which includes blazar

15

Presenter

Presentation Notes

Active Galatic Nucleus examples. Note increasing magnitude. Won’t show examples of other types.

Derive new statistics

• How?– Fit curves (by FDA, NP, Gaussian process

modeling)• FDA: registration• NP: incorporate known variances - we have a cute

method• GPM: use the known variances to build the prior

– Residuals:• Variability, outliers/signals, …

– Others16

Modeling the Light CurvesGaussian Process Regression

Can tweak:

• Smoothness• Signal variance• Error variance

Unusually, thisIs known

17

Presenter

Presentation Notes

Talk about general idea of modeling the light curves. GPR is just one approach that allows this. Talk about how the error variance is known in this case (which is unusual in Statistics). Talk about how GPR can use this information directly. Approach needs more work – currently using Lowess to model the curves.

Modeled Curve

ResidualsSummary measures

Generation of new summary measures

FittedSummary measures

Clusters of observationsIn 30 minute groups of 4

Summary measures

18

Presenter

Presentation Notes

Talk about how our new measures are based on the modeled curve. Also measures are developed from clusters of measurements in short bursts\

New Summary Statistics1. Whole curve measures

Median magnitude (mag); mean of absolute differences of successive observed magnitude; the maximum difference magnitudes

2. Fitted curve measuresScaled total variation scaled by number of days of observation; range of

fitted curve; maximum derivative in the fitted curve

3. Residual from fit measuresThe maximum studentized residual; SD of residuals; skewness of residuals;

Shapiro-Wilk statistic of residuals

4. Cluster measuresFit the means within the groups (up to 4 measurements); and then take

the logged SD of the residuals from this fit; the max absolute residuals from this fit; total variation of curve based on group means scaled by range of observation

5. Other 19

24

High dimensional views via modern graphics and PP

Available Data with non-variables and7 transient types

Training SetN=2480 Test Set

N=1240

Randomsplit

Richards Richards+New

LDA 63 76

Tree 69 74

SVM 76 82

Percent correctly classified in the test set:

Others: Multinomial logist DA + New Ensembles

25

Presenter

Presentation Notes

Talk about. Random split of data. Same methods used for each set of measures. Our measures have been added to the Richards measures. Talk about how the same default settings have been used on the classification procedures and that we expect we could optimize these to obtain even better performance.

VdCC- another approach to incorporate known variances

26

• Idea:

Whole Curve Comparisons

• PfClust -> PfClassification• Functional Centroid Method (FCC)–Model m(x) and of the whole curves for each class– Develop SCB for each m(x) – Define a functional distance measure between curves– Classify a new curve to one of the existing classes or a

new class of curves based on the distances

27

Development of Functional Method

Exploration Step: are they different and separable?

• Directly estimate the (pair-wise) mean difference between classes

• Bootstrap method to estimate the (point-wise) confidence intervals.

28

Selective comparison

29


30


31

Conclusion• Developed new derived statistics• Applied better modeling/classification procedures

– GRM, … • Moved on 5 methodology development directions:

– PfClassification, Functional CuvClass, VdCC, Ensemble, Scale-space Comparison

• Allowed for incremental classification

Our team had a great time working together and expect continuation of research and ties that will contribute to Statistics and sciences beyond this light curve analysis.

A revision of this talk with additional new work will be presented at the JSM ☺

32

A project of “Imaging Regression, Classification and ... · A project of “Imaging Regression,...

Documents

Transcript of A project of “Imaging Regression, Classification and ... · A project of “Imaging Regression,...