Finding outliers in multivariate data with measurement errors Somak Raychaudhury School of Physics...

Finding outliers in multivariate data with measurement errors

Somak RaychaudhurySchool of Physics and Astronomy

Jianyong Sun and Ata KabánSchool of Computer Science

University of Birmingham, UK

# Robust Mixtures in the Presence of Measurement Errors, Jianyong Sun, Ata Kabán and Somak Raychaudhury, International Conference on Machine Learning (ICML07), Corvallis, OR, June 2007(astro-ph/0709.0928)

# Robust Visual Mining of Data with Error Information, Jianyong Sun, Ata Kabán and Somak Raychaudhury, European Conference on Machine Learning/Practice of Knowledge Discovery in Databases (ECML/PKDD07), Warsaw, Sep 2007

Publications

Designer Algorithms® for Astronomy

Physics and Astronomy Computer Science

Jianyong Sun

Somak Raychaudhury

Trevor Ponman

Ian Stevens

Bill Chaplin

Louisa Nolan

Xin Yao

Peter Tino

Ata Kaban

Ela Claridge

Richard Dearden

•Started 2004•Algorithm development involving Machine learning (ANN, SVM, Gaussian processes), latent variables (e.g. GTM), Bayesian methods (e.g. ICA) and Genetic algorithms (model finding and fitting)•Emphasis on algorithm development, not applying existing software•1 PhD thesis complete: Kernel regression for time delays in gravitational lenses, four other PhD students involved •12 refereed publications, including ICML, ECML/PKDD, SDM, MNRAS, A&A

www.sr.bham.ac.uk/algorithms

RA Dec

Energy, SpectrumTime

Flux at 1

Flux at 2

Shape parameters

Along each axis the measurements are characterized by the position, extent, sampling and resolution. All astronomical measurements span some volume in this parameter space

Parameter Space of Observables

An example: SDSS quasar catalog

Courtesy: http://astrostatistics.psu.edu/

and Schneider et al 2005

§ Integers, real numbers, characters§ Continuous and discrete variables§ Logarithmic or linear variables§ Upper and lower limits§ Infinite range, or bounded above or below§ Missing data§ Errors!

An Example: Discoveries of High-Redshift Quasars and Type-2 Quasars in DPOSS

High-z QSO

Type-2 QSO

Djorgovski et al.

Somak Raychaudhury ADASS07

Astronomer: I have a list of objects with some measured parameters, and I’d like to discover unusual objects from them

Computer Scientist: All right- that’s a straightforward outlier detection problem

Astronomer: Ok, here’s a list of quasars with four colours from SDSS(Three weeks later) Computer Scientist : Here is the list of top

outlying objects.Astronomer: Ah, of course, many of these have large measurement

errors. They might not be genuine outliers.Computer Scientist : Error? What’s an error?Astronomer: Erm- they are the uncertainties on the measurements.Computer Scientist : You mean these are mistakes? Surely you could

have been more careful!Astronomer: Oh, no, no…. The errors depend on instrumental

limitations and observational conditions, and are unavoidable with physical measurements.

Computer Scientist : Um… OK, then give us your measurement errors.

A Galilean Dialogue


Finding outliers

Robust density modelling aims at capturing the structure of typical observations while dealing with outliers§ required to avoid biases of parameter estimates (here

outliers need to be thrown out)

§ Sometimes peculiar objects are of interest, for identifying candidates of possibly new kinds of objects (e.g. from archives of multi-wavelength astronomical images) that deserve follow-up study (e.g. using spectroscopy).

§ Bottleneck: the likely overabundance of interesting objects found- need to limit number of likely candidates


Goal of the exercise

• Given measurements from N objects, over d features, together with their associated measurement errors

• Find the peculiar objects whose peculiarity is not due to measurement errors (‘genuine’, potentially interesting outliers),

• along with a model of the density of non-outliers.

• An outlier-robust mixture model for data with known measurement errors

• Solve by a structured-variational EM algorithm• Find the outliers

– Controlled experiments & Real data

-- Sun, Kabán and Raychaudhury (astro-ph/0709.0928)


-15 -10 -5 0 5 10 15-15

-10

-5

0

5

10

15Robust density

modelling

Peel & McLachlan: Robust mixture

modeling using the t distribution. Statistics &

Computing, 2000

It assumes that the data are points.

Instead, the real scientific data also

contains error estimates.


-15 -10 -5 0 5 10 15-15

-10

-5

0

5

10

15 Can knowledge of the errors help us infer the ‘genuine’ outliers?


-15 -10 -5 0 5 10 15

-15

-10

-5

0

5

10

15 Can knowledge of the errors help us infer the ‘genuine’ outliers?


Models with Latent (‘Hidden’) Variables

• In many applications there are 2 sets of variables:– Variables whose values we can directly measure– Variables that are “hidden”, cannot be measured

• Examples:– Speech recognition:

• Observed: acoustic voice signal• Hidden: label of the word spoken

– Face tracking in images• Observed: pixel intensities• Hidden: position of the face in the image

– Text modelling• Observed: counts of words in a document• Hidden: topics that the document is about

Slide adapted from P Smyth: KDD tutorial


€

y = f(s,θ) + n

observed data

latent variables

noiseparameters

Linear models with Gaussian latent prior: FA, PPCA, PCALinear models with discrete latent prior: FMM

Linear models with non-Gaussian latent prior: IFA, ICA Linear model with latent prior over +ve domain & +ve

parameters: NMF Non-linear models with uniform latent prior: GTM, LTM

Some existing latent variable models

A wide class of latent variable models has the following form:

Specifications required!

1) p(s)

2) f(.)

S

y

€

θ


What are all these acronyms?

• FA = Factor Analysis• PCA = Principal Component Analysis• PPCA = Probabilistic Principal Component Analysis• FMM = Finite Mixture Models• IFA = Independent Factor Analysis• ICA = Independent Component Analysis (noise-free IFA)• NMF = Non-negative Matrix Factorisation• GTM = Generative Topographic Mapping• LTM = Latent Trait Model

See astro-ph/0709.0928


The Student t distribution

The t distributions were discovered by William Gosset in 1908. He wrote under the name ‘A Student’

2/)1(2 )/1()2/(

)2/)1(()( +−+

Γ+Γ

= kktkk

ktS

π

“For many applied problems, the tails of the normal distribution are shorter than required”


Determining the number of components

• We used a lower bound to the Minimum Message Length in

conjunction with our likelihood bound. Other methods are possible.

• The MML criterion is to maximise:


Learning with Latent Variables

• Guess some initial parameters

• E-step (Inference)– For each case, and each unknown variable compute

p (S | known data, ) • M-step (Optimization)

– Maximize L( ) using p(S | ….. )– This yields new parameter estimates

• This is the EM algorithm:– Guaranteed to converge to a (local) maximum of L(

) Dempster, Laird, Rubin, 1977

€

θ 0

€

θ 0

€

θ

€

θ1

€

θ


• 10,000 quasars extracted from the SDSS (DR4) quasar catalogue

• Five optical filters (u,g,i,r,z). From these, we construct 4 features,

u-r, g-r, i-r, r-z• Spectroscopic

redshifts used to validate the results.

SDSS quasars- can we find ones at z>2 ?

Fan et al 2000


Bottom line: z>2.5 quasars as outliers

AUC (“outlierness”) vs. different redshift thresholds. A large fraction of quasars at redshift 2.5 or higher are detected with high probability

This is not possible with 2D projections


Conclusions

• Knowledge of errors is useful for making the hunt for the ‘interesting’ outliers more directed

• Keeping some of the dependencies in the variational posterior density can increase accuracy

• The method can efficiently find high redshift quasars as outliers from the SDSS photometric quasar catalogue

• We are extending these findings to combined statistical and visual analysis of data with outliers and error information.


Implications of the choice for prior p(s) – Types of models

Dense & distributed Prototype-based(clustering)(compression)

(compression, clustering, structure discovery)

Lee & Seung, Nature 401, 788 (1999)

Sparse & distributed


Mixture of t-distributions (Robust mixture)

• Two sets of latent variables:

– A discrete class-variable z

– A Gamma variable u

• Maximum likelihood estimation via EM

• Inferred ‘peculiarity’ = E[u|tn], which is now obtained

via marginalising over the class-variable.

tn

un

N

nk

mk

Sk

zn

K

p

Peel & McLachlan, Statistics & Computing, 2000.


A robust mixture for data with errors

• A heteroschedastic Gaussian error model (with unknown mean and known variance) will account for the measurement errors

• A mixture of Student-t densities will model the density of ‘w’.

• Putting these together, the data likelihood is:

€

p(wn ) = π k

k=1

K

∑ St (wn | μ k ,Σk,υ k )

€

N(t | μ k,S+Σk

u)


• The joint likelihood of all variables:

wn

un

N

nk

mk

Sk

zn

K

p

tn Sn


Experiments and Results

1) How accurate is the structured variational approximation employed? - How it compares to a fully factorial approximation?- How it compares to the ‘ground truth’ Markov-Chain-

EM?2) To what extent does knowledge of measurement errors help

us to infer the outliers that are not due to these errors? - How it compares to ignoring the knowledge about the errors?- How it compares to the idealised case of having the clean data with no measurement errors?

3) Application in Astronomy: Finding peculiar objects of interest from the SDSS quasar catalogue (high-redshift quasars).


2) Accuracy of detecting ‘genuine’ outliers

• Synthetic data sets sampled from the model, starting from 3 well separated Gaussians and genuine outliers from a uniform distribution

• Five different error levels defined: Diagonals of S will range between [0,0.01], [0,0.1],[0,1],[0,10] and [0,100] respectively.

• Computed the Area Under the ROC curve (AUC) achieved by the inferred outlierness against the true outliers, averaged over 10 independent repeats.


Accuracy of detecting “genuine” outliers: Results on the synthetic data sets

In sample Out of sample


• Comparison with a Maximum A Posteriori (MAP) estimator (for un) in terms of outlier finding and cluster finding, computed from 10 independent realizations of the synthetic data

– MAP is more variable and less accurate than var-EMSome early results

• Synthetic data in 10-D: 3 clusters & outliers• It is beneficial to represent the outlierness in a 3rd

dimension.

MAP Var-EM P-value

AUC 0.915 ± 0.13 0.964 ± 0.015 0.0890

Clust accuracy(%)

0.920 ± 0.15 0.931 ± 0.014 0.0079

Finding outliers in multivariate data with measurement errors Somak Raychaudhury School of Physics...

Documents

Transcript of Finding outliers in multivariate data with measurement errors Somak Raychaudhury School of Physics...