Scalable Bayesian Network Classifiers · Scalable Bayesian Network Classi ers Geo Webb Ana Martinez...

ScalableBayesianNetwork

Classifiers

Geoff WebbAna MartinezNayyar Zaidi

Introduction

BayesianNetworkClassifiers

Selective KDB

Experiments

Conclusions &FutureResearch

Scalable Bayesian Network Classifiers

Classifiers

Introduction

Selective KDB

Experiments

Introduction

• Learning from large data

is not just about scaling-upexisting algorithms.

• Large data is best tackled by fundamentally new learningalgorithms.

Classifiers

Introduction

Selective KDB

Experiments

Introduction

• Learning from large data is not just about scaling-upexisting algorithms.

Classifiers

Introduction

Selective KDB

Experiments

Introduction

• Learning from large data is not just about scaling-upexisting algorithms.

Classifiers

Introduction

Selective KDB

Experiments

Learning curves

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• Error typically reduces as data quantity increases.

Classifiers

Introduction

Selective KDB

Experiments

Learning curves

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• Error typically reduces as data quantity increases.

• Different algorithms have different curves.

Classifiers

Introduction

Selective KDB

Experiments

Learning curves

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Overfitting on small data

• Algorithms that closely fit complex multivariatedistributions will tend to overfit small data

Classifiers

Introduction

Selective KDB

Experiments

Learning curves

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Best on large data

• Algorithms that closely fit complex multivariatedistributions will tend to overfit small data, but can betterfit large data

Classifiers

Introduction

Selective KDB

Experiments

Learning curves

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Best on large data

• Algorithms that closely fit complex multivariatedistributions will tend to overfit small data, but can betterfit large data: Bias-Variance Tradeoff

Classifiers

Introduction

Selective KDB

Experiments

Capacity to fit = model space + optimization

Training set size

Naïve Bayes

Logistic Regression

Classifiers

Introduction

Selective KDB

Experiments

Much research of questionable relevance

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• Most prior research has used relatively small data.

Classifiers

Introduction

Selective KDB

Experiments

Requirements

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• Need algorithms that can closely fit complex multivariatedistributions

, while being very computationally efficient.• low bias• few passes through data• out of core

Classifiers

Introduction

Selective KDB

Experiments

Requirements

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• Need algorithms that can closely fit complex multivariatedistributions, while being very computationally efficient.

• low bias• few passes through data• out of core

Classifiers

Introduction

Selective KDB

Experiments

Requirements

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• low bias

• few passes through data• out of core

Classifiers

Introduction

Selective KDB

Experiments

Requirements

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• low bias• few passes through data

• out of core

Classifiers

Introduction

Selective KDB

Experiments

Requirements

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

• low bias• few passes through data• out of core

Classifiers

Introduction

Selective KDB

Experiments

State-of-the-art

• Most low bias algorithms do not scale• Random Forest• SVM• Deep Learning

and inherently in-core

• Selective KDB is a scalable low-bias Bayesian NetworkClassifier

Classifiers

Introduction

Selective KDB

Experiments

State-of-the-art

Classifiers

Introduction

Selective KDB

Experiments

State-of-the-art

Classifiers

Introduction

Selective KDB

Experiments

Outline

1 Introduction

2 Bayesian Network Classifiers

3 Selective KDB

4 Experiments

5 Conclusions & Future Research

Classifiers

Introduction

Selective KDB

Experiments

Bayesian Network Classifiers

• Defined by parent relation π and Conditional ProbabilityTables (CPTs)

• π encodes conditional independence / structure• CPTs encode conditional probabilities

• Classifies using P(y | x) ∝ P(y | πY )∏

P(xi | πi )• Usually makes Y a parent of all Xi

X1 X2 X3 X4 X5

• Given π, CPTs can be learned by counting jointfrequencies

• single pass• incremental

Classifiers

Introduction

Selective KDB

Experiments

k-Dependence Bayes (KDB)

• Two pass learning

• 1st pass, learn structure:• Collect counts for pairs of attributes with the class.• Order attributes based on MI with the class.• Select parents based on CMI.

• no more than k parents• parents must be earlier in the order

X1 X2 X3 X4 X5

• 2nd pass, learn CPTs:• Collect statistics according to the structure learned.

Classifiers

Introduction

Selective KDB

Experiments

• Two pass learning

• 1st pass, learn structure:• Collect counts for pairs of attributes with the class.• Order attributes based on MI with the class.• Select parents based on CMI.

• no more than k parents• parents must be earlier in the order

X1 X2 X3 X4 X5

• 2nd pass, learn CPTs:• Collect statistics according to the structure learned.

Classifiers

Introduction

Selective KDB

Experiments

• Training Space: O(ya2v2 + yavk+1

)• Classification Space: O

(yavk+1

)• Training Time: O

(ta2 + ya2v2 + tak

)• Classification Time: O(yak)

t = no. of training examplesa = number of attributes;v = average number of values;y = number of classes

Classifiers

Introduction

Selective KDB

Experiments

• Parameter k controls bias-variance tradeoff

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

• No a priori means to anticipate best k

• Spurious attributes may increase error

Classifiers

Introduction

Selective KDB

Experiments

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

Classifiers

Introduction

Selective KDB

Experiments

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

Classifiers

Introduction

Selective KDB

Experiments

KDB models are nested

• Mi ,j = KDB with k = i using attributes 1 to j .

Attributes (ordered by MI)k 1 2 3 41 M1,1 M1,2 M1,3 M1,4

2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4

• Mi , j is a minor extension of Mi , j−1 and Mi−1, j

Classifiers

Introduction

Selective KDB

Experiments

2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4

Classifiers

Introduction

Selective KDB

Experiments

2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4

X1 X2 X3 X4 X5

Classifiers

Introduction

Selective KDB

Experiments

2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4

X1 X2 X3 X4 X5

Classifiers

Introduction

Selective KDB

Experiments

2 M2,1 M2,2 M2,3 M2,4

3 M3,1 M3,2 M3,3 M3,4

4 M4,1 M4,2 M4,3 M4,4

X1 X2 X3 X4 X5

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

• In one extra pass select an attribute subset and best kusing leave-one-out CV.

• The full model subsumes all k × a submodels.

• Very efficient selection between a large class of strongmodels.

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

Classifiers

Introduction

Selective KDB

Experiments

Leave-one-out CV

• Very low bias estimator of out-of-sample performance

• Incremental cross-validation makes it VERY efficient forBNC

• Collect counts from all data once• When classifying a hold-out object subtract it from the

counts

• All k × a nested models can be evaluated with little morecomputation than the full KDB model

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

• Training Space: O(ya2v2 + yavk+1

)• Classification Space: O

(ya∗vk

∗+1)

• Training Time: O(ta2 + ya2v2 + tayk

)• Classification Time: O(ya∗k∗)

a = number of attributes;v = average number of values;y = number of classesa∗ = number of attributes selected (a∗ ≤ a).k∗ = best value of k found (k∗ ≤ kmax).

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve Bayes

KDB k=1

KDB k=2

KDB k=3

KDB k=4

KDB k=5

SKDB Only K

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve BayesKDB k=1KDB k=2KDB k=3KDB k=4KDB k=5SKDB Only KSKDB k=5 Only Atts

Classifiers

Introduction

Selective KDB

Experiments

Selective KDB

0 100,000 200,000 300,000 400,000 500,000 600,000 700,000 800,000 900,000 1,000,000

Training set size

Naïve BayesKDB k=1KDB k=2KDB k=3KDB k=4KDB k=5SKDB Only KSKDB k=5 Only AttsSKDB

Classifiers

Introduction

Selective KDB

Experiments

16 datasets (165K-54M examples)No. of cases No. of No. of

Name (million) Atts. Classes Sizelocalization 0.165 5 11 11MB

census-income 0.299 41 2 136MB

USPSExtended 0.342 676 2 603MBs

MITFaceSetA 0.474 361 2 584MB

MITFaceSetB 0.489 361 2 603MB

MSDYear-Prediction 0.515 90 90 601MBs

covtype 0.581 54 7 72MB

MITFaceSetC 0.839 361 2 1.1GB

poker-hand 1.025 10 10 24MB

uscensus1990 2.458 67 4 325MB

PAMAP2 3.851 54 19 1.7GB

kddcup 5.210 41 40 754MB

linkage 5.749 11 2 251MB

mnist8ms 8.100 784 10 19GBs

satellite 8.705 138 24 3.6GB

splice 54.628 141 2 7.3GBs sparse format.

Classifiers

Introduction

Selective KDB

Experiments

Numeric attributes

• 5 bin equal frequency discretisation

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core BNCs

KDB5 vs SKDB best KDB vs SKDB

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core BNCs

• Training Time Comparisons

KDB k=5

co usc

Face kd

Datasets

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core BNCs

• Classification Time Comparisons

k=4k=3

k=5 k=4.5

k=4 k=4 k=2

k=4.8 k=51

Tim TAN

AODEKDB k=5SKDB

k 4 k 4 k 20,1

Datasets

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core SGD

• SGD in Vowpal Wabbit (VW):• Squared and logistic function.• Quadratic features (best results).• Different number of passes (3 or 10).• Discrete attributes into binary features.• One-against-all for multiclass classification.

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core SGD

0.0001 0.001 0.01 0.1 10.0001

VWLF - RMSE (log. scale)

0.0001 0.001 0.01 0.1 10.0001

VWSF - RMSE (log. scale)

VW (logistic) - RMSE VW (squared) - 0-1 Loss

VWRMSE (logistic) 0-1 Loss (squared)

Selective KDB (7+2)-0-7 8-1-7

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core SGD

tion yp

Datasets

Classifiers

Introduction

Selective KDB

Experiments

Out-of-core SGD

nsus etA

Datasets

Classifiers

Introduction

Selective KDB

Experiments

In-core BayesNet and RF

BayesNet RFx = sampled dataset

BayesNet RF (Num)k-selective KDB 6-3-3 5-0-6

Classifiers

Introduction

Selective KDB

Experiments

100000

BayesNet

type etA

e ( s RF

Datasets

Classifiers

Introduction

Selective KDB

Experiments

BayesNet

type etA

Datasets

Classifiers

Introduction

Selective KDB

Experiments

Global comparisons

• Cumulative Ranking

• Selective KDB copes well with high-dimensional datasetsand datasets with more than 1 million points.

• RF performs better on datasets with small # of attributes.

• VW has advantage for sparse numeric data.

Classifiers

Introduction

Selective KDB

Experiments

Global comparisons

• Ranking

3.47 3.44 3.34 2.912.533

NB AODE TAN BayesNet VWLF KDB (best k) RF (Num) SKDBy ( ) ( )

Learners

Classifiers

Introduction

Selective KDB

Experiments

Global comparisons (Time)

Learner

NB TAN AODE KDB ksKDB VWsq VWlog

Learner

− lo

NB TAN AODE KDB ksKDB VWsq VWlog

Training Test

Classifiers

Introduction

Selective KDB

Experiments

Stepping back

• The key trick is nested evaluation of a large class ofcount-based models

• Works also with 2-pass selective evaluation of AnDEmodels

• Combines the efficiency of simple generative techniqueswith the power of discriminative learning

• Can use any loss function that is a function of eachP̂(y | x)

Classifiers

Introduction

Selective KDB

Experiments

Future Research

• Numeric attributes• it seems that there should be something better than

discretisation.• this remains elusive ...

• Overfitting avoidance

• Increase number of alternative models considered

• Explore other forms of nested BNCs

• Two pass Selective KDB• sample small test set in second pass

• Single pass generative/discriminative learning• initially learn a simple generative model• collect discriminative statistics and refine the model when

there is sufficient evidence to make a choice• refinements might be attribute selection, structural

refinement or weighting

• repeat

Classifiers

Introduction

Selective KDB

Experiments

Satellite learning curves

Training set size

NBKDB K=1KDB K=2KDB K=3KDB K=4KDB K=5KDB K=6KDB K=7KDB K=8

Classifiers

Introduction

Selective KDB

Experiments

Training set size

NBKDB K=1KDB K=2KDB K=3KDB K=4KDB K=5KDB K=6KDB K=7KDB K=8SKDB K=10

Classifiers

Introduction

Selective KDB

Experiments

Training set size

NBKDB K=1KDB K=2KDB K=3KDB K=4KDB K=5KDB K=6KDB K=7KDB K=8SKDB K=10Random Forest

Classifiers

Introduction

Selective KDB

Experiments

Conclusions

• Large data calls for fundamentally new learning algorithms

• We are pioneering a new generation of theoreticallywell-founded algorithms that are both scalable to verylarge data and capable of exploiting the fine detailinherent therein

Scalable Bayesian Network Classifiers · Scalable Bayesian Network Classi ers Geo Webb Ana Martinez...

Documents

Transcript of Scalable Bayesian Network Classifiers · Scalable Bayesian Network Classi ers Geo Webb Ana Martinez...

Research Article A Bayesian Classifier Learning …downloads.hindawi.com/journals/mpe/2013/975953.pdf · A Bayesian Classifier Learning Algorithm Based on ... classi er learning algorithm

Multi-label LeGo | Enhancing Multi-label Classi ers with ... · PDF fileMulti-label LeGo | Enhancing Multi-label Classi ers with Local Patterns ... LeGo 1 Introduction Contrary to

Default Priors for Neural Network Classi cation - … Priors for Neural Network Classi cation ... Bayesian neural network; ... A default prior approach allows such propogation of uncertainty

Learning Continuous Time Bayesian Network Classifiers ... · 2 Learning Continuous Time Bayesian Network Classi ers Using MapReduce over the times when particular events occur. Bayesian

Fast agnostic classi cationskkaul/proposal.pdfrange of safe parameters for lattice-based cryptography. 1. Introduction This thesis studies classi ers. From initial conditions or inputs,

Frequency (kHz) 9 · The classi"ers were trained to estimate the vocalization category. Random Forest Sound Waveform Feature Space Classi!ers 40 Parameters = 40 PC Coe!-cients ...

Machine Learning - Bayesian Decision Theory and …vishy/fall2016/notes/BayesianDecision.pdf · Machine Learning Bayesian Decision Theory and Classi cation S.V:N. (vishy) Vishwanathan

BAYESIAN CLASSIFICATION AND REGRESSSION WITH HIGH …radford/ftp/longhai-thesis1.pdf · 2007-08-31 · Bayesian Classi cation and Regression with High Dimensional Features Longhai

Managing and Mining Sensor Data - Springer978-1-4614-6309-2/1.pdf · 5.7 Lossless vs. Lossy Compression 44 6. ... 2.2 Supervised Classi ers 434 2.3 Unsupervised Classi ers 436 3.

ABC-Miner+: Constructing Markov Blanket Classi ers … · (will be inserted by the editor) ABC-Miner+: Constructing Markov Blanket Classi ers ... (BAN) classi ers, where the class

Linear Classi ers - lxmls.it.ptlxmls.it.pt/2020/LxMLS_2020_Martins_Lecture.pdf · Linear Classi ers Andr e Martins Lisbon Machine Learning School, July 22, 2020 Andr e Martins (IST)

Ageing-based Multinomial Naive Bayes Classi ers over ...ntoutsi/papers/15.ECMLPKDD.pdf · Ageing-based Multinomial Naive Bayes Classi ers over Opinionated Data Streams Sebastian Wagner1,

Building Interpretable Classi ers with Rules using Bayesian

Investigation of Di erent Classi ers and Channel Con ...

Statistical comparison of classi ers through Bayesian ...repository.supsi.ch/7416/1/hierarchical.pdfWe propose a new approach for the statistical comparison of algorithms which have

Bayesian Network Classi ers in Wekaweka.sourceforge.net/manuals/weka.bn.pdf · Bayesian Network Classi ers in Weka Remco R. Bouckaert remco@cs.waikato.ac.nz September 1, 2004 Abstract

Bayesian Network Classi ers in WekaBayesian Network Classi ers in Weka Remco R. Bouckaert remco@cs.waikato.ac.nz September 1, 2004 Abstract Various Bayesian network classi er learning

Shift Happens: Adjusting Classi ers

Decoupled Classi ers for Group-Fair and E cient Machine ...proceedings.mlr.press/v81/dwork18a/dwork18a.pdf · Decoupled Classi ers for Group-Fair and E cient Machine Learning Cynthia

Naive Bayes Classifiers and Document Classification · Brandon Malone Naive Bayes Classi ers and Document Classi cation. The Multinomial Distribution Multinomial document model Naive