Chapter 6 The Structural Risk Minimization Principle Junping Zhang [email protected] Intelligent...
-
Upload
holly-meece -
Category
Documents
-
view
232 -
download
2
Transcript of Chapter 6 The Structural Risk Minimization Principle Junping Zhang [email protected] Intelligent...
Chapter 6 The Structural Risk Minimization Principle
Junping [email protected]
Intelligent Information Processing Laboratory, Fudan UniversityMarch 23, 2004
Objectives
Structural risk minimization
Two other induction principles
The Scheme of the SRM induction principle
Real-Valued functions
Principle of SRM
SRM
Minimum Description Length and SRM inductive principles
The idea about the Nature of Random Phenomena
Minimum Description Length Principle for the Pattern Recognition Problem
Bounds for the MDL SRM for the simplest Model and MDL The Shortcoming of the MDL
The idea about the Nature of Random Phenomena
Probability theory (1930s, Kolmogrov) Formal inference Axiomatization hasn’t considered nat
ure of randomness Axioms: given probability measures
The idea about the Nature of Random Phenomena
The model of randomness Solomonoff (1965), Kolmogrov (1965), Ch
aitin (1966). Algorithm (descriptive) complexity
The length of the shortest binary computer program
Up to an additive constant does not depend on the type of computer.
Universal characteristic of the object.
A relatively large string describing an object is random If algorithm complexity of an object is high If the given description of an object cannot be
compressed significantly. MML (Wallace and Boulton, 1968)& MDL
(Rissanen, 1978) Algorithm Complexity as a main tool of induc
tion inference of learning machines
Minimum Description Length Principle for the Pattern Recognition Problem
Given l pairs containing the vector x and the binary value ω
Consider two strings: the binary string
Question
Q: Given (147), is the string (146) a random object?
A: to analyze the complexity of the string (146) in the spirit of Solomonoff-Kolmogorov-Chaitin ideas
Compress its description
Since ω i i=1,…l are binary values, the string (146) is described by l bits.
Since training pairs were drawn randomly and independently.
The value ω i depend on the vector xi but not on the vector xj.
Model
General Case: not contain the perfect table.
Randomness
Bounds for the MDL
Q: Does the compression coefficient
K(T) determine the probability of the test error in classification (decoding) vectors x by the table T?
A: Yes
Comparison between the MDL and ERM in the simplest model
SRM for the simplest Model and MDL
SRM for the simplest Model and MDL
The power of compression coefficient
To obtain bound for the probability of error
Only information about the coefficient need to be known.
The power of compression coefficient
How many examples we used How the structure of code books was
organized Which code book was used and how
many tables were in this code book. How many errors were made by the
table from the code book we used.
MDL principle
To minimize the probability of error One has to minimize the coefficient
of compression
The shortcoming of the MDL
MDL uses code books with a finite number of tables.
Continuously depends on parameters, one has to first quantize that set to make the tables.
Quantization
How do we make the ‘smart’ quantization for a given number of observations.
For a given set of functions, how can we construct a code book with a small number of tables but with good approximation ability?
The shortcoming of the MDL
Finding a good quantization is extremely difficult and determines the main shortcoming of MDL principle.
The MDL principle works well when the problem of constructing reasonable code books has a good solution.
Consistency of the SRM principle and asymptotic bounds on the rate of convergence
Q: Is the SRM consistent? What is the bound on the
(asymptotic) rate of convergence?
Consistency of the SRM principle.
Simplification version
Remark
To avoid choosing the minimum of functional (156) over the infinite number of elements of the structure.
Additional constraint Choose the minimum from the first l
elements of the structure where l is equal to the number of observations.
Discussions and Example
The rate of convergence is determined by two contradictory requirements on the rule n=n(l). The first summand: The larger n=n(l) , the small
er is the deviation The second summand: The larger n=n(l), the lar
ger deviation For structures with a known bound on the r
ate of approximation, select the rule that assures the largest rate of convergence.
Bounds for the regression estimation problem
The model of regression estimation by series expansion
Example
The problem of approximating functions
To get high asymptotic rate of approximation
the only constraint is that the kernel should be a bounded
function which can be described as a family of functions possessing finite VC dimension.
Problem of local risk minimization
Local Risk Minimization Model
Note
Using local risk minimization methods, one probably does not need rich sets of approximating functions. Whereas the classical semi-local
methods are based on using a set of constant functions.
Note
For local estimation functions in the one-dimensional case, it is probably enough to consider elements Sk, k=0,1,2,3 containing the polynomials of degree 0,1,2,3
Summary MDL SRM Local Risk Functional