Journal of Machine Learning Research 2 2002 499-526 Submitted 7/01 Published 3/02
Stability and Generalization
Olivier Bousquet bousquetcmapx.polytechnique.fr
CMAP Ecole Polytechnique
F-91128 Palaiseau FRANCE
Andr e Elissee‹ andre.elisseeffbiowulf.com
BIOwulf Technologies
305 Broadway
New-York NY 10007
Editor: Dana Ron
Abstract
We dene notions of stability for learning algorithms and show how to use these notions to
derive generalization error bounds based on the empirical error and the leave-one-out error. The
methodsweusecanbeappliedintheregressionframeworkaswellasintheclassicationonewhen
the classier is obtained by thresholding a real-valued function. We study the stability properties
of large classes of learning algorithms such as regularization based algorithms. In particular we
focus on Hilbert space regularization and Kullback-Leibler regularization. We demonstrate how
to apply the results to SVM for regression and classication.
1. Introduction
A key issue in the design of eŽcient Machine Learning systems is the estimation of the accuracy of
learning algorithms. Among the several approaches that have been proposed to this problem one of
the most prominent is based on the theory of uniform convergence of empirical quantities to their
mean see e.g. Vapnik 1982. This theory provides ways to estimate the risk or generalization
error of a learning system based on an empirical measurement of its accuracy and a measure of its
complexity such as the Vapnik-Chervonenkis VC dimension or the fat-shattering dimension see
e.g. Alon et al. 1997.
We explore here a di‹erent approach which is based on sensitivity analysis. Sensitivity analysis
aims at determining how much the variation of the input can inuence the output of a system.
1
It has been applied to many areas such as statistics and mathematical programming. In the latter
domainitisoftenreferredtoasperturbationanalysisseeBonnansandShapiro1996forasurvey.
The motivation for such an analysis is to design robust systems that will not be a‹ected by noise
corrupting the inputs.
In this paper the objects of interest are learning algorithms. They take as input a learning set
made of instance-label pairs and output a function that maps instances to the corresponding labels.
The sensitivity in that case is thus related to changes of the outcome of the algorithm when the
learningsetischanged. Therearetwosourcesofrandomnesssuchalgorithmshavetocopewith: the
rstonecomesfromthesamplingmechanismusedtogeneratethelearningsetandthesecondoneis
duetonoiseinthemeasurementsontheinstanceand/orlabel. Incontrasttostandardapproaches
to sensitivity analysis we mainly focus on the sampling randomness and we thus are interested in
howchangesinthecompositionofthelearningsetinuencethefunctionproducedbythealgorithm.
The outcome of such an approach is a principled way of getting bounds on the di‹erence between
1. For a qualitative discussion about sensitivity analysis with links to other resources see e.g.
http://sensitivity-analysis.jrc.cec.eu.int/
c 2002 Olivier Bousquet and Andr e Elissee‹.

slide 2:

Bousquet and Elisseeff
empiricalandgeneralizationerror. Theseboundsareobtainedusingpowerfulstatisticaltoolsknown
as concentration inequalities. The latter are the mathematical device corresponding to the following
statement from Talagrand 1996:
A random variable that depends in a “smooth way” on the inuence of many indepen-
dent variables but not too much on any of them is essentially constant.
The expression “essentially constant” actually means that the random variable will have with
high probability a value close to its expected value. We will apply these inequalities to the random
variable we are interested in that is the di‹erence between an empirical measure of error and the
true generalization error. We will see that this variable has either a zero expectation or it has a
nice property: the condition under which it concentrates around its expectation implies that its
expectation is close to zero. That means that if we impose conditions on the learning system such
that the di‹erence between the empirical error and the true generalization error is roughly constant
thenthisconstantiszero. Thisobservationandtheexistenceofconcentrationinequalitieswillallow
us to state exponential bounds on the generalization error of a stable learning system.
The outline of the paper is as follows: after reviewing previous work in the area of stability
analysisoflearningalgorithms weintroducethreenotionsofstabilitySection3andderivebounds
on the generalization error of stable learning systems Section 4. In Section 5 we show that many
existing algorithms such as SVM for classication and regression ridge regression or variants of
maximum relative entropy discrimination do satisfy the stability requirements. For each of these
algorithms it is then possible to derive original bounds which have many attractive properties.
Previous work
It has long been known that when trying to estimate an unknown function from data one needs to
ndatradeo‹betweenbiasandvariance.
2
Indeedononehanditisnaturaltousethelargestmodel
in order to be able to approximate any function while on the other hand if the model is too large
then the estimation of the best function in the model will be harder given a restricted amount of
data. Severalideashavebeenproposedtoghtagainstthisphenomenon. Oneofthemistoperform
estimation in several models of increasing size and then to choose the best estimator based on a
complexitypenaltye.g. StructuralRiskMinimization. Thisallowstocontrolthecomplexitywhile
allowing to use a large model. This technique is somewhat related to regularization procedures that
we will study in greater detail in subsequent sections. Another idea is to use statistical procedures
to reduce the variance without altering the bias. One such technique is the bagging approach of
Breiman 1996a which consists in averaging several estimators built from random subsamples of
the data.
Although it is generally accepted that having a low variance or a high stability in our termi-
nology is a desirable property for a learning algorithm there are few quantitative results relating
the generalization error to the stability of the algorithm with respect to changes in the training set.
The rst such results were obtained by Devroye Rogers and Wagner in the seventies see Rogers
and Wagner 1978 Devroye and Wagner 1979ab. Rogers and Wagner 1978 rst showed that
the variance of the leave-one-out error can be upper bounded by what Kearns and Ron 1999 later
called hypothesis stability. This quantity measures how much the function learned by the algorithm
will change when one point in the training set is removed. The main distinctive feature of their
approach is that unlike VC-theory based approaches where the only property of the algorithm that
matters is the size of the space to be searched it focuses on how the algorithm searches the space.
This explains why it has been successfully applied to the k-Nearest Neighbors algorithm k-NN
whose search space is known to have an innite VC-dimension. Indeed results from VC-theory
2. We deliberately do not provide a precise denition of bias and variance and resort to common intuition about
these notions. In broad terms the bias is the best error that can be achieved and the variance is the di‹erence
between the typical error and the best error.
500

slide 3:

Stability and Generalization
would not be of any help in that case since they are meaningful when the learning algorithm per-
forms minimization of the empirical error in the full function space. However the k-NN algorithm
is very stable because of its ’locality’. This allowed Rogers and Wagner to get an upper bound on
thedi‹erencebetweentheleave-one-outerrorandthegeneralizationerrorofsuchaclassier. These
results were later extended to obtain bounds on the generalization error of k-local rules in Devroye
and Wagner 1979a and of potential rules in Devroye and Wagner 1979b.
In the early nineties concentration inequalities became popular in the probabilistic analysis
of algorithms due to the work of McDiarmid 1989 and started to be used as tools to derive
generalizationboundsforlearningalgorithmsbyDevroye1991. Buildingonthistechnique Lugosi
and Pawlak 1994 obtained new bounds for the k-NN kernel rules and histogram rules. These
bounds used “smoothed estimates” of the error which estimate the posterior probability of error
instead of simply counting the errors. This smoothing is very much related to the use of real-
valued classiers and we will see that it is at the heart of the applicability of stability analysis to
classication algorithms. A comprehensive account of the application of McDiarmid’s inequality to
obtain bounds for the leave-one-out error or the smoothed error of local classiers can be found in
Devroye et al. 1996.
Independentlyfromthistheoreticalanalysispracticalmethodshavebeendevelopedtodealwith
instabilityoflearningalgorithms. InparticularBreiman1996abintroducedtheBaggingtechnique
which is presented as a method to combine single classiers in such a way that the variance of the
overall combination is decreased. However there is no theoretical guarantee that this variance
reduction will bring an improvement on the generalization error.
Finally a more recent work has shown an interesting connection between stability and VC-
theory. Kearns and Ron 1999 derived what they called sanity-check bounds. In particular they
provedthatanalgorithmhavingasearchspaceofniteVC-dimension isstableinthesensethatits
stability in a sense to be dened later is bounded by its VC-dimension. Thus using the stability
as a complexity measure does not give worse bounds than using the VC-dimension.
The work presented here follows and extends the stability approach of Lugosi and Pawlak 1994
in that we derive exponential upper bounds on the generalization error based on notions of stability.
It is based on earlier results presented in Bousquet and Elissee‹ 2001. We consider both the leave-
one-out error and the empirical error as possible estimates of the generalization error. We prove
stability bounds for a large class of algorithms which includes the Support Vector Machines both in
the regression and in the classication cases. Also we generalize some earlier results from Devroye
and Wagner.
2. Preliminaries
We rst introduce some notation and then the main tools we will use to derive inequalities.
2.1 Notations
X andY R being respectively an input and an output space we consider a training set
S fz
1
x
1
y
1
::z
m
x
m
y
m
g
of size m in Z X Y drawn i.i.d. from an unknown distribution D. A learning algorithm
is a function A from Z
m
into F Y
X
which maps a learning set S onto a function A
S
from X
to Y. To avoid complex notation we consider only deterministic algorithms. It is also assumed
that the algorithm A is symmetric with respect to S i.e. it does not depend on the order of the
elements in the training set. Furthermore we assume that all functions are measurable and all sets
are countable which does not limit the interest of the results presented here.
Given a training set S of size m we will build for all i 1::::m modied training sets as
follows:
501

slide 4:

Bousquet and Elisseeff
By removing the i-th element
S
ni
fz
1
:::z
i1
z
i+1
:::z
m
g:
By replacing the i-th element
S
i
fz
1
:::z
i1
z
0
i
z
i+1
:::z
m
g:
where the replacement example z
0
i
is assumed to be drawn from D and is independent from S.
Unless they are clear from context the random variables over which we take probabilities and
expectation will be specied in subscript. We thus introduce the notation P
S
: andE
S
: to denote
respectively the probability and the expectation with respect to the random draw of the sample
S of size m drawn according to D
m
. Similarly P
z
: and E
z
: will denote the probability and
expectation when z is sampled according to D.
In order to measure the accuracy of the predictions of the algorithm we will use a cost function
c :YY R
+
. The loss of an hypothesis f with respect to an example z xy is then dened
as
‘fzcfxy:
We will consider several measures of the performance of an algorithm. The main quantity we are
interested in is the risk or generalization error. This is a random variable depending on the training
set S and it is dened as
RASE
z
‘A
S
z :
Unfortunately R cannot be computed since D is unknown. We thus have to estimate it from the
available data S. We will consider several estimators for this quantity.
The simplest estimator is the so-called empirical error also known as resubstitution estimate
dened as
R
emp
AS
1
m
m
X
i1
‘A
S
z
i
:
Another classical estimator is the leave-one-out error also known as deleted estimate dened as
R
loo
AS
1
m
m
X
i1
‘A
S
niz
i
:
When the algorithm is clear from context we will simply write RS R
emp
S and R
loo
S.
We will often simplify further the notations when the training sample is clear from context. In
particular we will use the following shorthand notations R RAS R
emp
R
emp
AS and
R
loo
R
loo
AS.
2.2 Main Tools
The study we describe here intends to bound the di‹erence between empirical and generalization
error for specic algorithms. For any 0 our goal is to bound the term
P
S
jR
emp
ASRASj 1
which di‹ers from what is usually studied in learning theory
P
S
"
sup
f2F
jR
emp
fRfj
: 2
502

slide 5:

Stability and Generalization
Indeed we do not want to have a bound that holds uniformly over the whole space of possible
functions since we are interested in algorithms that may not explore it. Moreover we may not even
have a way to describe this space and assess its size. This explains why we want to focus on 1.
Ourapproachisbasedoninequalitiesthatrelatemomentsofmulti-dimensionalrandomfunctions
totheirrstordernitedi‹erences. TherstoneisduetoSteele1986andprovidesboundsforthe
variance. The second one is a version of Azuma’s inequality due to McDiarmid 1989 and provides
exponential bounds but its assumptions are more restrictive.
Theorem 1 Steele 1986 Let S and S
i
dened as above let F : Z
m
R be any measurable
function then
E
S
h
FSE
S
FS
2
i
1
2
m
X
i1
E
Sz
0
i
h
FSFS
i
2
i
Theorem 2 McDiarmid 1989 Let S and S
i
dened as above let F :Z
m
R be any measur-
able function for which there exists constants c
i
i1:::m such that
sup
S2Z
m
z
0
i
2Z
FSFS
i
c
i
then
P
S
FSE
S
FSe
2
2
n
i1
c
2
i
:
3. Dening the Stability of a Learning Algorithm
Therearemanywaystodeneandquantifythestabilityofalearningalgorithm. Thenaturalwayof
makingsucha denition isto start fromthe goal: wewantto get boundson the generalizationerror
of specic learning algorithm and we want these bounds to be tight when the algorithm satises the
stability criterion.
As one may expect the more restrictive a stability criterion is the tighter the corresponding
bound will be.
In the learning model we consider the randomness comes from the sampling of the training set.
We will thus consider stability with respect to changes in the training set. Moreover we need an
easy to check criterion so that we will consider only restricted changes such as the removal or the
replacement of one single example in the training set.
Although not explicitly mentioned in their work the rst such notion was used by Devroye and
Wagner 1979a in order to get bounds on the variance of the error of local learning algorithms.
Later Kearns and Ron 1999 stated it as a denition and gave it a name. We give here a slightly
modied version of Kearns and Ron’s denition that suits our needs.
Denition 3 Hypothesis Stability An algorithm A has hypothesis stability with respect to
the loss function ‘ if the following holds
8i2f1:::mg E
Sz
j‘A
S
z‘A
S
nizj: 3
Note that this is the L
1
norm with respect to D so that we can rewrite the above as
E
S
k‘A
S
:‘A
S
ni:k
1
We will also use a variant of the above denition in which instead of measuring the average
change we measure the change at one of the training points.
Denition 4 Pointwise Hypothesis Stability An algorithm A has pointwise hypothesis sta-
bility with respect to the loss function ‘ if the following holds
8i2f1:::mg E
S
j‘A
S
z
i
‘A
S
niz
i
j: 4
503

slide 6:

Bousquet and Elisseeff
Another weaker notion of stability was introduced by Kearns and Ron. It consists of measuring the
change in the expected error of the algorithm instead of the average pointwise change.
Denition 5 Error Stability An algorithm A has errorstability with respect to the loss func-
tion ‘ if the following holds
8S2Z
m
8i2f1:::mg jE
z
‘A
S
zE
z
‘A
S
nizj 5
which can also be written
8S2Z
m
8i2f1:::mg jRSR
ni
Sj: 6
Finallyweintroduceastrongernotionofstabilitywhichwillallowtogettightbounds. Moreover
we will show that it can be applied to large classes of algorithms.
Denition 6 Uniform Stability An algorithm A has uniformstability with respect to the loss
function ‘ if the following holds
8S2Z
m
8i2f1:::mg k‘A
S
:‘A
S
ni:k
1
: 7
Notice that 3 implies 5 and 7 implies 3 so that uniform stability is the strongest notion.
Considered as a function of m the term will sometimes be denoted by
m
. We will say that
an algorithm is stable when the value of
m
decreases as
1
m
. An algorithm with uniform stability
has also the following property:
8S 8z
0
i
j‘A
S
z‘A
S
izjj‘A
S
z‘A
S
nizj+j‘A
S
iz‘A
S
nizj2:
In other words stability with respect to the exclusion of one point implies stability with respect
to changes of one point.
We will assume further that as a function of the sample size the stability is non-increasing. This
will be the case in all our examples. This assumption is not restrictive since its only purpose is to
simplify the statement of the theorems we will always upper bound
m1
by
m
.
4. Generalization Bounds for Stable Learning Algorithms
We start this section by introducing a useful lemma about the bias of the estimators we study.
Lemma 7 For any symmetric learning algorithm A we have 8i2f1::mg:
E
S
RASR
emp
ASE
Sz
0
i
‘A
S
z
0
i
‘A
S
iz
0
i
and
E
S
h
RAS
ni
R
loo
AS
i
0
and
E
S
RASR
loo
ASE
Sz
‘A
S
z‘A
S
niz
Proof For the rst equality we just need to compute the expectation of R
emp
AS. We have
E
S
R
emp
S
1
m
m
X
j1
E
S
‘A
S
z
j
1
m
m
X
j1
E
Sz
0
i
‘A
S
z
j
and renaming z
j
as z
0
i
we get8i2f1::mg
E
S
R
emp
SE
Sz
0
i
‘A
S
iz
0
i
504

slide 7:

Stability and Generalization
by the i.i.d. and the symmetry assumptions. This proves the rst equality. Similarly we have
E
S
R
loo
S
1
m
m
X
i1
E
S
‘A
S
niz
i
1
m
m
X
i1
E
Sz
‘A
S
niz
from which we deduce the second and third equalities.
Remark 8 We notice from the above lemma comparing the rst and last equalities that the em-
pirical error and the leave-one-out error di‹er from the true error in a similar way. It is usually
accepted that the empirical error is very much optimistically biased while the leave-one-out error
is almost unbiased due to the second equation of the lemma. However we will see that for the
particular algorithms we have in mind which display high stability the two estimators are very
close to each other. The similarity of the bounds we will derive for both estimators will be striking.
This can be explained intuitively by the fact that we are considering algorithms that do not directly
minimize the empirical error but rather a regularized version of it so that the bias in the empirical
error will be reduced.
4.1 Polynomial Bounds with Hypothesis Stability
InthissectionwegeneralizealemmafromDevroyeandWagner1979b. Theirapproachconsistsin
bounding the second order moment of the estimators with the hypothesis stability of the algorithm.
For this purpose one could simply use Theorem 1. However this theorem gives a bound on the
variance and we need here the second order moment of the di‹erence between the error leave-one-
out or empirical and the generalization error. It turns out that a direct study of this quantity leads
to better constants than the use of Theorem 1.
Lemma 9 For any learning algorithm A and loss function ‘ such that 0 cyy
0
M we have
for any ij2f1:::mg i6j for the empirical error
E
S
RR
emp
2
M
2
2m
+3ME
Sz
0
i
j‘A
S
z
i
‘A
S
iz
i
j 8
and
E
S
RR
emp
2
M
2
2m
+ME
Sz
0
i
z
j‘A
S
z‘A
S
izj
+ME
Sz
0
i
j‘A
S
z
j
‘A
S
iz
j
j
+ME
Sz
0
i
j‘A
S
z
i
‘A
S
iz
i
j
and for the leave-one-out error
E
S
RR
loo
2
M
2
2m
+3ME
Sz
j‘A
S
z‘A
S
nizj 9
and
E
S
RR
loo
2
M
2
2m
+2ME
Sz
0
i
z
j‘A
S
z‘A
S
izj+j‘A
S
z‘A
S
nizj : 10
The proof of this lemma is given in the appendix.
Remark 10 Notice that Devroye and Wagner’s work focused on the leave-one-out estimator and
on classication. We extend it to regression and to the empirical estimator which they treated with
the following easy-to-prove inequality
E
S
RR
emp
2
2E
S
RR
loo
2
+2ME
S
j‘A
S
z
i
‘A
S
niz
i
j
which gives a similar result but with worse constants.
505

slide 8:

Bousquet and Elisseeff
Let’s try to explain the various quantities that appear in the upper bounds of the above lemma. We
notice that the term
M
2
2m
is always present and it cannot be avoided even for a very stable algorithm
and somehow corresponds to the bias of the estimator. In Inequality 8 the expectation in the
right-hand side corresponds to the following situation: starting from training set S we measure the
error at point z
i
2 S then we replace z
i
2 S by z
0
i
and we again measure the error at z
i
which is
no longer in the training set. Then in the second inequality of Lemma 9 several di‹erent quantities
appear. TheyallcorrespondtocomparingthealgorithmtrainedonS andonS
i
wherez
i
isreplaced
by z
0
i
but the comparison point di‹ers: it is either z a point which is not part of the training set
or z
j
a point of the training set di‹erent from z
i
or nally z
i
.
For the leave-one-out error in 9 we consider the average di‹erence in error when trained on
S and on S
ni
where z
i
has been removed and in 10 the rst expectation in the right hand side
corresponds to the average di‹erence in error when one point is changed while the second one is the
average di‹erence in error when one point is removed.
All these quantities capture a certain aspect of the stability of the algorithm. In order to use the
lemmaweneedtoboundthemforspecicalgorithms. Insteadofusingallthesedi‹erentquantities
we will rather focus on the few notions of stability we introduced and see how they are related. We
will see later how they can be computed or upper bounded in particular cases.
Now that we have a bound on the expected squared deviation of the estimator to the true
error the next step is to use Chebyshev’s inequality in order to get a bound which holds with high
probability on the deviation.
Theorem 11 For any learning algorithm A with hypothesis stability
1
and pointwise hypothesis
stability
2
with respect to a loss function ‘ such that 0 cyy
0
M we have with probability
1
RASR
emp
AS+
r
M
2
+12Mm
2
2m
and
RASR
loo
AS+
r
M
2
+6Mm
1
2m
:
Proof First notice that for all S and all z
j‘A
S
z‘A
S
izjj‘A
S
z‘A
S
nizj+j‘A
S
niz‘A
S
izj
so that we get
E
Sz
0
i
j‘A
S
z
i
‘A
S
iz
i
jE
S
j‘A
S
z
i
‘A
S
niz
i
j+E
Sz
0
i
j‘A
S
niz
i
‘A
S
iz
i
j2
2
:
We thus get by 8
E
S
RR
emp
2
M
2
2m
+6M
2
:
Also we have by 10
E
S
RR
loo
2
M
2
2m
+3M
1
:
Now recall that Chebyshev’s inequality gives for a random variable X
PX
E
X
2
2
which in turn gives that for all 0 with probability at least 1
X
r
EX
2
:
506

slide 9:

Stability and Generalization
Applying this to RR
emp
and RR
loo
respectively give the result.
As pointed out earlier there is a striking similarity between the above bounds which seems to
supportthefactthatforastablealgorithmthetwoestimatorsthatweareconsideringhaveaclosely
related behavior.
In the next section we will see how to use the exponential inequality of Theorem 2 to get better
bounds.
4.2 Exponential Bounds with Uniform Stability
Devroye and Wagner 1979a rst proved exponential bounds for k-local algorithms. However the
question of whether their technique can be extended to more general classes of algorithms is a topic
for further research.
InDevroyeetal.1996another moregeneraltechniqueisintroducedwhichreliesonconcentra-
tion inequalities. Inspired by this approach we will derive exponential bounds for algorithms based
on their uniform stability.
We will study separately the regression and the classication cases for reasons that will be made
clear.
4.2.1 Regression Case
A stable algorithm has the property that removing one element in its learning set does not change
much of its outcome. As a consequence the di‹erence between empirical and generalization error
if thought as a random variable should have a small variance. If its expectation is small stable
algorithmsshouldthenbegoodcandidatesfortheirempiricalerrortobeclosetotheirgeneralization
error. This assertion is formulated in the following theorem:
Theorem 12 Let A be an algorithm with uniform stability with respect to a loss function ‘ such
that 0 ‘A
S
z M for all z 2Z and all sets S. Then for any m 1 and any 2 01 the
following bounds hold separately with probability at least 1 over the random draw of the sample
S
RR
emp
+2 +4m +M
r
ln1
2m
11
and
RR
loo
+ +4m +M
r
ln1
2m
: 12
Remark 13 This theorem gives tight bounds when the stability scales as 1m. We will prove that
this is the case for several known algorithms in later sections.
Proof Let’sprovethattheconditionsofTheorem2areveriedbytherandomvariablesofinterest.
First we study how these variables change when one training example is removed. We have
jRR
ni
jE
z
j‘A
S
z‘A
S
nizj 13
and
jR
emp
R
ni
emp
j
1
m
X
j6i
j‘A
S
z
j
‘A
S
niz
j
j+
1
m
j‘A
S
z
i
j
+
M
m
:
507

slide 10:

Bousquet and Elisseeff
Then we upper bound the variation when one training example is changed:
jRR
i
jjRR
ni
j+jR
ni
R
i
j2:
Similarly we can write
jR
emp
R
i
emp
jjR
emp
R
ni
emp
j+jR
ni
emp
R
i
emp
j2 +2
M
m
:
however a closer look reveals that the second factor of 2 is not needed. Indeed we have
jR
emp
R
i
emp
j
1
m
X
j6i
j‘A
S
z
j
‘A
S
iz
j
j+
1
m
j‘A
S
z
i
‘A
S
iz
0
i
j
1
m
X
j6i
j‘A
S
z
j
‘A
S
niz
j
j+
1
m
X
j6i
j‘A
S
niz
j
‘A
S
iz
j
j
+
1
m
j‘A
S
z
i
‘A
S
iz
0
i
j
2 +
M
m
:
Thus the random variable RR
emp
satises the conditions of Theorem 2 with c
i
4 +
M
m
.
ItthusremainstoboundtheexpectationofthisrandomvariablewhichcanbedoneusingLemma
7 and the -stability property:
E
S
RR
emp
E
Sz
0
i
j‘A
S
z
0
i
‘A
S
iz
0
i
j
E
Sz
0
i
j‘A
S
iz
0
i
‘A
S
niz
0
i
j+E
Sz
0
i
j‘A
S
niz
0
i
‘A
S
z
0
i
j
2:
Which yields
P
S
RR
emp
+2
m
exp
2m
2
4m
m
+M
2
:
Thus setting the right hand side to we obtain that with probability at least 1
RR
emp
+2
m
+4m
m
+M
r
ln1
2m
and thus
RR
emp
+2
m
1+
p
2mln1
+M
r
ln1
2m
which gives 11
For the leave-one-out error we proceed similarly. We have
jR
loo
R
ni
loo
j
1
m
X
j6i
j‘A
S
njz
j
‘A
S
nijz
j
j+
1
m
j‘A
S
niz
i
j
m1
+
M
m
and also
jR
loo
R
i
loo
j2
m1
+
M
m
2
m
+
M
m
:
508

slide 11:

Stability and Generalization
So that Theorem 2 can be applied to RR
loo
with c
i
4
m
+
M
m
. Then we use Lemma 7 along
with 13 to deduce
P
S
RR
loo
+
m
exp
2m
2
m4
m
+M
2
which gives 12 by setting the right hand side to and using e
1
.
Once again we notice that the bounds for the empirical error and for the leave-one-out error
are very similar. As we will see in later sections this clearly indicates that our method is not at all
suited to the analysis of algorithms which simply perform the minimization of the empirical error
which are not stable in the sense dened above.
4.2.2 Classification Case
In this section we consider the case where Y f11g and the algorithm A returns a function A
S
that maps instances inX to labels inf11g. The cost function is then simply
cA
S
xy1
fyA
S
x0g
:
Thus we see that because of the discrete nature of the cost function the Uniform Stability of an
algorithmwithrespecttosuchacostfunctioncanonlybe 0or 1. Intherstcase itmeans
that the algorithm is always returning the same function. In the second case there is no hope of
obtaininginterestingboundssincewesawthatweneed O
1
m
forourboundstogiveinteresting
results.
We thus have to proceed in a di‹erent way. One possible approach is to modify our error
estimates so that they become “smoother” and have higher stability. The idea to smooth error
estimators to decrease their variance is not new and it has even been used in conjunction with
McDiarmid’s inequality by Lugosi and Pawlak 1994 in order to derive error bounds for certain
algorithms. Lugosi and Pawlak studied algorithms which produce estimates for the distributions
PXjY 1 and PXjY +1 and dened analogues of the resubstitution and leave-one-out
estimates of the error suited to these algorithms.
Here we will take a related though slightly di‹erent route. Indeed we will consider algorithm
having a real-valued output. However we do not require this output to correspond to a posterior
probability but it should simply have the correct sign. That is the label predicted by such an
algorithm is the sign of its real-valued output. Of course a good algorithm will produce outputs
whose absolute value somehow represents the condence it has in the prediction.
Inordertoapplytheresultsobtainedsofartothissettingweneedtointroducesomedenitions.
Denition 14 A real-valued classication algorithm A is a learning algorithm that maps training
sets S to functions A
S
:X R such that the label predicted on an instance x is the sign of A
S
x.
This class of algorithm includes for instance the classiers produced by SVM or by ensemble
methods such as boosting.
Notice that the cost function dened above extends to the case where the rst argument is a real
number and have the desired properties: it is zero when the algorithm does predict the right label
and 1 otherwise.
Denition 15 Classication Stability A real-valued classication algorithm A has classica-
tion stability if the following holds
8S2Z
m
8i2f1:::mg kA
S
:A
S
ni:k
1
: 14
509

slide 12:

Bousquet and Elisseeff
We introduce a modied cost function:
c
yy
0
8
:
1 for yy
0
0
1yy
0
for 0yy
0
0 for yy
0
and we denote
‘
fzc
fxy:
Accordingly we dene the following error estimates
R
emp
AS
1
m
m
X
i1
‘
A
S
z
i
and similarly
R
loo
AS
1
m
m
X
i1
‘
A
S
niz
i
:
The loss ‘
will count an error each time the function f gives an output close to zero the closeness
being controlled by .
Lemma 16 A real-valued classication algorithm A with classication stability has uniform sta-
bility with respect to the loss function ‘
.
Proof It is easy to see that c
is 1 -Lipschitz with respect to its rst argument and so does ‘
by
denition. Thus we have for all i all training set S and all z
jl
A
S
zl
A
S
nizjjc
A
S
xyc
A
S
nixyj
1
jA
S
xA
S
nixj:
We can thus apply Theorem 12 with the loss function ‘
and get the following theorem.
Theorem 17 Let A be a real-valued classication algorithm with stability . Then for all 0
any m1 and any 201 with probability at least 1 over the random draw of the sample S
RR
emp
+2
+
4m
+1
r
ln1
2m
15
and with probability at least 1 over the random draw of the sample S
RR
loo
+
+
4m
+1
r
ln1
2m
: 16
Proof We apply Theorem 12 to A with the loss function ‘
which is bounded by M 1 and for
which the algorithm is -stable. Moreover we use the fact that RA
S
R
E
z
l
A
S
z.
In order to make this result more practically useful we need a statement that would hold uni-
formly for all values . The same techniques as in Bartlett 1996 lead to the following result:
Theorem 18 Let A be a real-valued classication algorithm with stability and B be some real
number. Then for any m 1 and any 2 01 with probability at least 1 over the random
draw of the sample S
8 20B RR
emp
+2
e
+
4me
+1
r
1
2m
p
ln1 +
s
2lnln
eB
17
510

slide 13:

Stability and Generalization
and
8 20B RR
loo
+
e
+
4me
+1
r
1
2m
p
ln1 +
s
2lnln
eB
18
We defer the proof of this theorem to the appendix.
We can thus apply Theorem 18 with a value of which is optimized after having seen the data.
5. Stable Learning Algorithms
Asseeninprevioussections ourapproachallowedtoderiveboundsonthegeneralizationerrorfrom
the empirical and leave-one-out errors which depend on the stability of the algorithm. However
we noticed that the bounds we obtain for the two estimators are very similar. This readily implies
that the method is suited to the study of algorithms for which the empirical error is close to the
leave-one-outerror. Thereisthusnohopetogetgoodboundsforalgorithmswhichsimplyminimize
the empirical error since their empirical error will be very much optimistically biased compared to
their leave-one-out error.
This means that in order to be stable in the sense dened above a learning algorithm has to
signicantly depart from an empirical risk minimizer. It thus has to accept a signicant number
of training errors which should however not be larger that the noise level. In order to generalize
these extra training errors will thus be compensated by a decrease of the complexity of the learned
function.
Insomesensethisisexactlywhatregularization-basedalgorithmdo: theyminimizeanobjective
function which is the sum of an empirical error term and a regularizing term which penalizes the
complexityofthesolution. Thisexplainswhyourapproachisparticularlywellsuitedfortheanalysis
of such algorithms.
5.1 Previous Results for k-Local Rules
As an illustration of the various notions of stability we will rst study the case of k-Local Rules for
which a large number of results were obtained.
A k-Local Rule is a classication algorithm that determines the label of an instance x based on
the k closest instances in the training set. The simplest example of such a rule is the k-Nearest
Neighbors k-NN algorithm which computes the label by a majority vote among the labels of the k
nearest instances in the training set. Such an algorithm can be studied as a f01g-valued classier
or as a 01-valued classier if we take into account the result of the vote.
Wewillconsiderthereal-valuedversionofthe k-NNclassierandgivearesultaboutitsstability
with respect to di‹erent loss functions.
1. With respect to thef01g-loss function the k-NN classier has hypothesis stability
4
m
r
k
2
:
This was proven in Devroye and Wagner 1979a. We will not reproduce the proof which is
quite technical but notice that a symmetry argument readily gives
PA
S
z6A
S
niz
k
m
:
2. With respect to the absolute loss function cyy
0
jyy
0
j the k-NN classier has only a
trivial uniform stability which is the bound on the values of y.
511

slide 14:

Bousquet and Elisseeff
The polynomial bound that can be obtained from hypothesis stability suggests that k should be
small if one wants a good bound. This is somehow counter-intuitive since the decision seems more
robust to noise when many points are involved in the vote. There exist exponential bounds on the
leave-one-out estimate of k-NN for thef01g-loss obtained by Devroye and Wagner 1979a and for
the smoothed error estimate i.e. with respect to the absolute loss obtained by Lugosi and Pawlak
1994 and these bounds do not depend on the parameter k due to a more careful application of
McDiarmid’s inequality suited to the algorithm. We may then wonder in that case whether the
polynomialboundsareinterestingcomparedtoexponentialonessincethelatteraresharperandare
closertointuitiveinterpretation. Despitethisexamplewebelievethatingeneralpolynomialbounds
could give relevant hints about which feature of the learning algorithm leads to good generalization.
In the remainder we will consider several algorithms that have not been studied from a stability
perspective and we will focus on their uniform stability only which turns out to be quite good.
Obtaining results directly for their hypothesis stability remains an open problem.
5.2 Stability of Regularization Algorithms
Uniform stability may appear as a strict condition. Actually we will see in this section that many
existing learning methods exhibit a uniform stability which is controlled by the regularization pa-
rameter and can thus be very small.
5.2.1 Stability for General Regularizers
Recall that ‘fz cfxy. We assume in this section that F is a convex subset of a linear
space.
Denition 19 A loss function ‘ dened on FY is -admissible with respect toF if the associated
cost function c is convex with respect to its rst argument and the following condition holds
8y
1
y
2
2D 8y
0
2Yjcy
1
y
0
cy
2
y
0
jjy
1
y
2
j
where D fy :9f 2F9x2Xfxyg is the domain of the rst argument of c.
Thus in the case of the quadratic loss for example this condition is veried if Y is bounded and F
is totally bounded that is there exists M 1 such that
8f 2Fkfk
1
M and 8y2YjyjM:
Weintroducetheobjectivefunctionthatthealgorithmwillminimize: let N :F R
+
beafunction
onF
R
r
g:
1
m
m
X
j1
‘gz
j
+N g 19
and a modied version based on a truncated training set
R
ni
r
g:
1
m
X
j6i
‘gz
j
+N g: 20
Depending on the algorithm N will take di‹erent forms. To derive stability bounds we need some
general results about the minimizers of 19 and 20.
Lemma 20 Let ‘ be -admissible with respect to F and N a functional dened on F such that
for all training sets S R
r
and R
ni
r
have a minimum not necessarily unique in F. Let f denote a
512

slide 15:

Stability and Generalization
minimizer in F of R
r
and for i 1:::m let f
ni
denote a minimizer in F of R
ni
r
. We have for
any t201
NfNf +t f+Nf
ni
Nf
ni
t f
t
m
j fx
i
j 21
where f f
ni
f.
Proof
Let us introduce the notation
R
ni
emp
f:
1
m
X
j6i
‘fz
j
:
Recall that a convex function g veries:
8xy8t201 gx+tyxgxtgygx:
Since c is convex R
ni
emp
is convex too and thus8t201
R
ni
emp
f +t fR
ni
emp
ftR
ni
emp
f
ni
R
ni
emp
f:
We can also get switching the role of f and f
ni
:
R
ni
emp
f
ni
t fR
ni
emp
f
ni
tR
ni
emp
fR
ni
emp
f
ni
:
Summing the two preceding inequalities yields
R
ni
emp
f +t fR
ni
emp
f+R
ni
emp
f
ni
t fR
ni
emp
f
ni
0: 22
Now by assumption we have
R
r
fR
r
f +t f 0 23
R
ni
r
f
ni
R
ni
r
f
ni
t f 0 24
so that summing the two previous inequalities and using 22 we get
cfx
i
y
i
cf +t fx
i
y
i
+m
NfNf +t f+Nf
ni
Nf
ni
t f
0
and thus by the -admissibility condition we get
NfNf +t f+Nf
ni
Nf
ni
t f
t
m
j fx
i
j:
In the above lemma there is no assumption about the space F apart from being a convex linear
space and the regularizer N apart from the existence of minima for R
r
and R
ni
r
. However most of
the practical regularization-based algorithms work with a space F that is a vector space and with
a convex regularizer. We will thus rene our previous result in this particular setting. In order to
do this we need some standard denitions about convex functions which we deferred to Appendix
C where most of the material can be found in Rockafellar 1970 and in Gordon 1999.
Lemma 21 Under the conditions of Lemma 20 when F is a vector space and N is a proper closed
convex function from F to Rf1+1g we have
d
N
ff
ni
+d
N
f
ni
f
1
m
‘f
ni
z
i
‘fz
i
d
‘:zi
f
ni
f
m
j fx
i
j
when N and ‘ are di‹erentiable.
513

slide 16:

Bousquet and Elisseeff
Proof We start with the di‹erentiable case and work with regular divergences. By denition of f
and f
ni
we have using 30
d
Rr
f
ni
f+d
R
ni
r
ff
ni
R
r
f
ni
R
r
f+R
ni
r
fR
ni
r
f
ni
1
m
‘f
ni
z
i
1
m
‘fz
i
:
Moreover by the nonnegativity of divergences we have
d
R
ni
emp
ff
ni
+d
R
ni
emp
f
ni
f0
which with the previous equality and the fact that d
A+B
d
A
+d
B
gives
d
N
ff
ni
+d
N
f
ni
f
1
m
‘f
ni
z
i
‘fz
i
d
‘:z
i
f
ni
f
and we obtain the rst part of the result. For the second part we notice that
‘f
ni
z
i
‘fz
i
d
‘:zi
f
ni
f‘f
ni
z
i
‘fz
i
by the nonnegativity of the divergence and thus
‘f
ni
z
i
‘fz
i
d
‘:zi
f
ni
fjf
ni
x
i
fx
i
j
by the -admissibility condition.
The results in this section can be used to derive bounds on the stability of many learning algo-
rithms. Each procedure that can be interpreted as the minimization of a regularized functional can
be analyzed with these lemmas. The only thing that will change from one procedure to another is
the regularizer N and the cost function c. In the following we show how to apply these theorems
to di‹erent learning algorithms.
5.2.2 Application to Regularization in Hilbert Spaces
Many algorithms such as Support Vector Machines SVM or classical regularization networks in-
troduced by Poggio and Girosi 1990 perform the minimization of a regularized objective function
where the regularizer is a norm in a reproducing kernel Hilbert space RKHS:
Nfkfk
2
k
where k refers to the kernel see e.g. Wahba 2000 or Evgeniou et al. 1999 for denitions. The
fundamental property of a RKHSF is the so-called reproducing property which writes
8f 2F 8x2X fxhfkx:i :
In particular this gives by Cauchy-Schwarz inequality
8f 2F 8x2X jfxjkfk
k
p
kxx: 25
We now state a result about the uniform stability of RKHS learning.
Theorem 22 LetF be a reproducing kernel Hilbert space with kernel k such that8x2X kxx
2
1. Let ‘ be -admissible with respect to F. The learning algorithm A dened by
A
S
argmin
g2F
1
m
m
X
i1
‘gz
i
+kgk
2
k
26
has uniform stability with respect to ‘ with
2
2
2m
:
514

slide 17:

Stability and Generalization
Proof Weusetheprooftechniquedescribedinprevioussection. Itcanbeeasilycheckedthatwhen
N:k:k
2
k
we have
d
N
gg
0
kgg
0
k
2
k
:
Thus Lemma 20 gives
2k fk
2
k
m
j fx
i
j:
Using 25 we get
j fx
i
jk fk
k
p
kx
i
x
i
k fk
k
so that
k fk
k
2m
:
Now we have by the -admissibility of ‘
j‘fz‘f
ni
zjjfxf
ni
xjj fxj
which using 25 again gives the result.
We are now one step away from being able to apply Theorem 12. The only thing that we need is
to bound the loss function. Indeed the -admissibility condition does not ensure the boundedness.
However since we are in a RKHS we can use the following simple lemma which ensures that if we
have an a priori bound on the target values y then the boundedness condition is satised.
Lemma 23 Let A be the algorithm of Theorem 22 where ‘ is a loss function associated to a convex
cost function c::. We denote by B: a positive non-decreasing real-valued function such that for
all y2D.
8y
0
2Y cyy
0
By
For any training set S we have
kfk
2
k
B0
and also
8z2Z 0‘A
S
zB
r
B0
:
Moreover ‘ is -admissible where can be taken as
sup
y
0
2Y
sup
jyjB
B0
c
y
yy
0
:
Proof We have for f A
S
R
r
fR
r
0
1
m
m
X
i1
‘
0z
i
B0
and also R
r
f kfk
2
k
which gives the rst inequality. The second inequality follows from 25.
The last one is a consequence of the denition of -admissibility.
Example 1 Stability of bounded SVM regression Assume k is a bounded kernel that is
kxx
2
and Y 0B. Consider the loss function
‘fzjfxyj
0 if jfxyj
jfxyj otherwise
515

slide 18:

Bousquet and Elisseeff
This function is 1-admissible and we can state ByB. The SVM algorithm for regression with a
kernel k can be dened as
A
S
argmin
g2F
1
m
m
X
i1
‘gz
i
+kgk
2
k
and we thus get the following stability bound
2
2m
:
Moreover by Lemma 23 we have
8z2Z 0‘A
S
z
r
B
Plugging the above into Theorem 12 gives the following bound
RR
emp
+
2
m
+
2
2
+
r
B
r
ln1
2m
:
Note that we consider here SVM without the bias b which is strictly speaking di‹erent from the true
denition of SVM. The question whether b can be included in such a setting remains open.
Example 2 Stability of soft margin SVM classication We have Y f11g. We con-
sider the following loss function
‘fz1yfx
+
1yfx if 1yfx0
0 otherwise
which is 1-admissible. From Lemma 20 we deduce that the real-valued classication obtained by the
SVM optimization procedure has classication stability with
2
2m
:
We use Theorem 17 with 1 and thus get
RR
1
emp
+
2
m
+
1+
2
2
r
ln1
2m
where R
1
emp
is the clipped error. It can be seen that R
1
emp
1
m
P
m
i1
‘fz
i
1
m
P
m
i1
i
where the
are the Lagrange multipliers that appear in the dual formulation of the soft-margin SVM.
Note that the same remark as in the previous example holds here: there is no bias b in the
denition of the SVM.
Example 3 Stability of Regularized Least Squares Regression Againwewillconsiderthe
bounded case Y 0B. The regularized least squares regression algorithm is dened by
A
S
argmin
g2F
1
m
m
X
i1
‘gz
i
+kgk
2
k
where ‘fzfxy
2
. We can state ByB
2
so that ‘ is 2B-admissible by Lemma 23. Also
we have
8z2Z 0‘A
S
z
r
B
:
516

slide 19:

Stability and Generalization
The stability bound for this algorithm is thus
2
2
B
2
m
so that we have the generalization error bound
RR
emp
+
4
2
B
2
m
+
8
2
B
2
+2B
r
ln1
2m
:
5.2.3 Regularization by the Relative Entropy
In this section we consider algorithms that build a mixture or a weighted combination of base
hypotheses.
Let’s consider a setH of functions h:X Y parameterized by some parameter :
H fh
: 2‚g:
This set is the base class from which the learning algorithm will form mixtures by averaging the
predictions of base hypotheses. More precisely we assume that ‚ is a measurable space where
a reference measure is dened. The output of our algorithm is a mixture of element from ‚ in
other words it is a probability distribution over ‚. We will thus choose F as the set of all such
probabilitydistributionsdominatedbythereferencemeasuredenedbytheirdensitywithrespect
to the reference measure.
Once an element f 2F is chosen by the algorithm the predictions are computed as follows
ˆ yx
Z
‚
h
xfd
which means that the prediction produced by the algorithm is indeed a weighted combination of the
predictions of the base hypotheses weighted by the density f. In Bayesian terms A
S
would be a
posterior on ‚ computed from the observation of S and ˆ yx is the corresponding Bayes prediction.
By some abuse of notation we will denote by A
S
both the element f 2 F that is used by the
algorithm to weigh the base hypotheses which can be considered as a function ‚ R and the
prediction function x2X 7 ˆ yx.
Now we need to dene a loss function on FZ. This can be done by extending a loss function
r dened on HZ with associated cost function s rhz shxy. There are two ways
of deriving a loss function on F. We can simply use s to compute the discrepancy between the
predicted and true labels
‘gzsˆ yxy 27
or we can average the loss over ‚
‘gz
Z
‚
rh
zgd: 28
The rst loss is the one used when one is doing Bayesian averaging of hypotheses. The second loss
corresponds to the expected loss of a randomized algorithm that would sample h2H according to
the posterior A
S
to perform the predictions.
In the remainder we will focus on the second type of loss since it is easier to analyze. Note
however that this loss will be used only to dene a regularization algorithm and that the loss that
is used to measure its error may be di‹erent.
Our goal is to choose the posterior f via the minimization of a regularized objective function.
We choose some xed density f
0
and dene the regularizer as
NgKgf
0
Z
‚
gln
g
f
0
d
517

slide 20:

Bousquet and Elisseeff
K being the Kullback-Leibler divergence or the relative entropy. In Bayesian terms f
0
would be
our prior. Now the goal is to minimize the following objective function
R
r
g
1
m
m
X
i1
‘gz+K gf
0
where ‘ is given by 28. We can interpret the minimization of this objective function as the
computation of the Maximum A Posteriori MAP estimate.
Let’s analyze this algorithm. We will assume that we know a bound M on the loss rh
z.
First notice that ‘ is linear in g and is thus convex and M-Lipschitz with respect to the L
1
norm
j‘gz‘g
0
zjM
Z
‚
jgg
0
jd:
Thus ‘ is M-admissible with respect toF.
We can now state the following result on the uniform stability of the algorithm dened above.
Theorem 24 Let F dened as above and let r be any loss function dened on HZ bounded by
M. Let f
0
be a xed member of F. When ‘ is dened by 28 the learning algorithm A dened by
A
S
argmin
g2F
1
m
m
X
i1
‘gz
i
+K gf
0
29
has uniform stability with respect to ‘ with
M
2
m
:
Proof Recall the following property of the relative entropy see e.g. Cover and Thomas 1991 for
any gg
0
1
2
Z
‚
jgg
0
jd
2
Kgg
0
:
Moreover the Bregman divergence associated to the relative entropy to f
0
is
d
K:f0
gg
0
Kgg
0
:
We saw that ‘ is M-admissible thus by Lemma 21 we get
Z
‚
jff
ni
jd
2
M
m
Z
‚
jff
ni
jd
hence Z
‚
jff
ni
jd
M
m
and thus using again the M-admissibility of ‘ we get for all z2Z
j‘fz‘f
ni
zj
M
2
m
which concludes the proof.
Now let’s consider the case of classication where Y f11g. If we use base hypotheses h
that
return values inf11g it is easy to see from the proof of the above theorem that algorithm A has
classication stability
M
m
. Indeed we have
jA
S
xA
S
nixj
Z
‚
h
xA
S
A
S
nid
Z
‚
jA
S
A
S
nijd
M
m
where the last inequality is derived in the proof of Theorem 24.
518

slide 21:

Stability and Generalization
Example 4 Maximum Entropy Discrimination Jaakola et al. 1999 introduce the Mini-
mum Relative Entropy MRE algorithm which is a real-valued classier obtained by minimizing
R
r
g
1
m
m
X
i1
‘gz+K gf
0
where the base class has two parameters H fh
: 2 ‚ 2 Rg with h
h
and the loss is
dened by
‘gz
Z
‚R
yh
xgdd
+
:
If we have a bound B on the quantity yh
x we see that this loss function is B-admissible
and thus by Theorem 24 and the remark about the classication stability we deduce that the MRE
algorithm has classication stability bounded by
B
m
6. Discussion
For regularization algorithms we obtained bounds on the uniform stability of the order of
O
1
m
. Plugging this result into our main theorem we obtained bounds on the generalization error
of the following type
RR
emp
+O
1
p
m
sothatweobtainnontrivialresultsonlyifwecanguaranteethat
1
p
m
. Thisislikelytodepend
on the noise in the data and no theoretical results exist that guarantee that does not decrease too
fast when m is increased.
However it should be possible to rene our results which used sometimes quite crude bounds.
It seems reasonable that a bound like
RR
emp
+O
1
p
m
could be possible to obtain. This remains an open problem.
In order to better understand the distinctive feature of our bounds we can compare them to
bounds from Structural Risk Minimization SRM for example on the SVM algorithm. The SVM
algorithm can be presented using the two equivalent formulations
min
f2F
1
m
m
X
i1
1y
i
fx
i
+
+kfk
2
or
min
f2F
1
m
m
X
i1
1y
i
fx
i
+
withkfk
2
The equivalence of those two problems comes from the fact that for any there exists a such that
the solution of the two problems are the same.
TheSRMprincipleconsistsinsolvingthesecondproblemforseveralvaluesof andthenchoosing
the value that minimizes a bound that depends on the VC-dimension of the set ff : kfk
2
g.
However this quantity is usually not easy to compute and only loose upper bounds can be found.
Moreover since minimization under a constraint on the norm is not easy to perform one typically
519

slide 22:

Bousquet and Elisseeff
performstherstminimizationforaparticularvalueof chosenbycross-validationandthenuses
SRM bounds with kfk
2
. This requires the SRM bounds to hold uniformly for all values of .
This approach has led to bound which were quite predictive of the behavior but that were
quantitatively very loose.
In contrast our approach directly focuses on the actual minimization that is performed the
rst one and does not require the computation of a complexity measure. Indeed the complexity is
implicitly evaluated by the actual parameter .
7. Conclusion
We explored the possibility of obtaining generalization bounds for specic algorithms from stability
properties. We introduced several notions of stability and obtained corresponding generalization
bounds with either the empirical error or the leave-one-out error. Our main result is an exponential
bound for algorithms that have good uniform stability. We then proved that regularization algo-
rithms have such a property and that their stability is controlled by the regularization parameter .
This allowed us to obtained bounds on the generalization error of Support Vector Machines both in
the classication and in the regression framework that do not depend on the implicit VC-dimension
but rather depend explicitly on the tradeo‹ parameter C.
Furtherdirectionsofresearchincludethequestionofobtainingbetterboundsviauniformstabil-
ity and the use of less restrictive notions of stability. Of great practical interest would be to design
algorithms that maximize their own stability.
Acknowledgements
The authors wish to thank Ryan Rifkin and Ralf Herbrich for fruitful comments that helped im-
proved the readability and Alex Smola G abor Lugosi St ephane Boucheron and Sayan Mukherjee
for stimulating discussions.
Appendix A. Proof of Lemma 9
Let’s start with a generalized version of a lemma from Rogers and Wagner 1978.
Lemma 25 For any learning algorithm A any ij2f1:::mg such that i6j we have
E
S
RR
emp
2
E
Szz
0‘A
S
z‘A
S
z
0
2E
Sz
‘A
S
z‘A
S
z
i
+E
S
‘A
S
niz
i
‘A
S
njz
j
+
M
m
E
S
‘A
S
z
i
1
m
E
S
‘A
S
z
i
‘A
S
z
j
and
E
S
RR
loo
2
E
Szz
0‘A
S
z‘A
S
z
0
2E
Sz
‘A
S
z‘A
S
niz
i
+E
S
‘A
S
niz
i
‘A
S
njz
j
+
M
m
E
S
h
R
ni
i
1
m
E
S
‘A
S
niz
i
‘A
S
njz
j
Proof We have
E
S
R
2
E
S
h
E
z
‘A
S
z
2
i
E
S
E
z
‘A
S
zE
z
0‘A
S
z
0
E
S
E
zz
0‘A
S
z‘A
S
z
0
520

slide 23:

Stability and Generalization
and also
E
S
RR
emp
E
S
"
R
1
m
m
X
i1
‘A
S
z
i
1
m
m
X
i1
E
S
R‘A
S
z
i
1
m
m
X
i1
E
Sz
‘A
S
z‘A
S
z
i
E
Sz
‘A
S
z‘A
S
z
i
and also
E
S
RR
loo
E
S
"
R
1
m
m
X
i1
‘A
S
niz
i
1
m
m
X
i1
E
S
R‘A
S
niz
i
1
m
m
X
i1
E
Sz
‘A
S
z‘A
S
niz
i
E
Sz
‘A
S
z‘A
S
niz
i
for any xed i by symmetry. Also we have
E
S
R
2
emp
1
m
2
m
X
i1
E
S
‘A
S
z
i
2
+
1
m
2
X
i6j
E
S
‘A
S
z
i
‘A
S
z
j
M
m
E
S
"
1
m
m
X
i1
‘A
S
z
i
+
m1
m
E
S
‘A
S
z
i
‘A
S
z
j
M
m
E
S
‘A
S
z
i
+
m1
m
E
S
‘A
S
z
i
‘A
S
z
j
and
E
S
R
2
loo
1
m
2
m
X
i1
E
S
‘A
S
niz
i
2
+
1
m
2
X
i6j
E
S
‘A
S
niz
i
‘A
S
njz
j
M
m
E
S
"
1
m
m
X
i1
‘A
S
niz
i
+
m1
m
E
S
‘A
S
niz
i
‘A
S
njz
j
M
m
E
S
h
R
ni
i
+
m1
m
E
S
‘A
S
niz
i
‘A
S
njz
j
:
which concludes the proof.
Nowlet’sproveLemma9. Wewilluseseveraltimesthefactthattherandomvariablesarei.i.d. and
we can thus interchange them without modifying the expectation it is just a matter of renaming
them. We introduce the notation T S
nij
and we will denote by A
Tzz
0 the result of training on
the set T zz
0
.
Let’s rst formulate the rst inequality of Lemma 25 as
E
S
RR
emp
2
1
m
E
S
‘A
S
z
i
M‘A
S
z
j
521

slide 24:

Bousquet and Elisseeff
+E
Szz
0‘A
S
z‘A
S
z
0
‘A
S
z‘A
S
z
i
+E
Szz
0‘A
S
z
i
‘A
S
z
j
‘A
S
z‘A
S
z
i
I
1
+I
2
+I
3
:
Using Schwarz’s inequality we have
E
S
‘A
S
z
i
M‘A
S
z
j
2
E
S
‘A
S
z
i
2
E
S
M‘A
S
z
j
2
M
2
E
S
‘A
S
z
i
E
S
M‘A
S
z
j
M
2
E
S
‘A
S
z
i
ME
S
‘A
S
z
i
M
4
4
so that we conclude
I
1
M
2
2m
:
Now we rewrite I
2
as
E
Szz
0
‘A
Tziz
j
z‘A
Tz
i
zj
z
0
‘A
Tz
iz
j
z‘A
Tz
i
z
j
z
i
E
Szz
0
‘A
Tz
izj
z‘A
Tz
i
zj
z
0
‘A
Tzj
z
0z‘A
Tz
jz
0z
0
renaming z
i
as z
0
in the second term
E
Szz
0
‘A
Tz
i
zj
z‘A
Tzzj
z‘A
Tziz
j
z
0
+E
Szz
0
‘A
Tzzj
z‘A
Tzjz
0z‘A
Tz
i
zj
z
0
+E
Szz
0
‘A
Tzizj
z
0
‘A
Tz
j
z
0z
0
‘A
Tz
j
z
0z
:
Next we rewrite I
3
as
E
Szz
0
‘A
Tz
i
zj
z
i
‘A
Tz
i
zj
z
j
‘A
Tz
i
z
j
z‘A
Tzi
z
j
z
i
E
Szz
0
‘A
Tzz
0z‘A
Tzz
0z
0
‘A
Tz
i
zj
z‘A
Tz
i
z
j
z
i
renaming z
j
as z
0
and z
i
as z in the rst term
E
Szz
0‘A
Tzz
0z‘A
Tzz
0z
0
‘A
Tz
0
zi
z‘A
Tz
0
z
i
z
0
exchanging z
i
and z
j
then renaming z
j
as z
0
in the second term
E
Szz
0‘A
Tzz
0z
0
‘A
Tzzi
z
0
‘A
Tzz
0z
+E
Szz
0‘A
Tzz
0z‘A
Tz
iz
0z‘A
Tzzi
z
0
+E
Szz
0‘A
Tzzi
z
0
‘A
Tz
0
z
i
z
0
‘A
Tz
iz
0z
E
Szz
0
‘A
Tzj
z
0z
0
‘A
Tzjz
i
z
0
‘A
Tzj
z
0z
j
+E
Szz
0
‘A
Tzz
j
z‘A
Tzi
z
j
z‘A
Tzz
i
z
j
+E
Szz
0
‘A
Tz
0
z
j
z‘A
Tzz
j
z‘A
Tzjz
z
0
where in the last line we replaced z by z
j
in the rst term and z
0
by z
j
in the second term and we
exchanged z and z
0
and also z
i
and z
j
in the last term.
Summing I
2
and I
3
we obtain
I
2
+I
3
E
Szz
0
‘A
Tz
i
zj
z‘A
Tzz
j
z‘A
Tz
i
z
j
z
0
‘A
Tzzi
z
j
+E
Szz
0
‘A
Tzzj
z‘A
Tz
j
z
0z‘A
Tz
i
zj
z
0
‘A
Tz
j
z
z
0
+E
Szz
0
‘A
Tz
i
zj
z
0
‘A
Tz
jz
0z
0
‘A
Tzj
z
0z‘A
Tzj
z
0z
j
3ME
Sz
j‘A
Tz
i
z
j
z‘A
Tzz
j
zj
3ME
Sz
0
i
j‘A
S
z
i
‘A
S
iz
i
j
522

slide 25:

Stability and Generalization
Which proves the rst part of the bound.
For the second part we use the same technique and slightly vary the algebra. We rewrite I
2
as
E
Szz
0
‘A
Tz
i
z
j
z‘A
Tz
i
z
j
z
0
‘A
Tzi
z
j
z‘A
Tzizj
z
i
E
Szz
0
‘A
Tz
iz
j
z‘A
Tzi
z
j
z
0
‘A
Tzz
j
z
0
‘A
Tzzj
z
renaming z
i
as z and z as z
0
in the second term
E
Szz
0
‘A
Tz
i
z
j
z
0
‘A
Tzz
j
z
0
‘A
Tzi
z
j
z
+E
Szz
0
‘A
Tzizj
z‘A
Tzz
j
z‘A
Tzz
j
z
0
:
Next we rewrite I
3
as
E
Szz
0
‘A
Tz
i
zj
z
i
‘A
Tz
iz
j
z
j
‘A
Tzi
z
j
z‘A
Tz
izj
z
i
E
Szz
0
‘A
Tziz
z
i
‘A
Tziz
z‘A
Tzi
zj
z‘A
Tz
izj
z
i
renaming z
j
as z in the rst term
E
Szz
0
‘A
Tz
iz
z
i
‘A
Tz
i
z
j
z
i
‘A
Tz
iz
z
+E
Szz
0
‘A
Tz
i
z
z‘A
Tz
iz
j
z‘A
Tz
izj
z
i
E
Szz
0
‘A
Tz
i
z
z
i
‘A
Tz
i
z
j
z
i
‘A
Tzi
z
z
+E
Szz
0
‘A
Tz
j
z
z‘A
Tzizj
z‘A
Tz
i
z
j
z
j
exchanging z
i
and z
j
in the second term:
Summing I
2
and I
3
we obtain
I
2
+I
3
E
Szz
0
‘A
Tz
i
zj
z
0
‘A
Tzzj
z
0
‘A
Tz
i
z
j
z
+E
Szz
0
‘A
Tzi
z
z
i
‘A
Tz
i
zj
z
i
‘A
Tzi
z
z
+E
Szz
0
‘A
Tz
j
z
z‘A
Tzi
z
j
z‘A
Tzzj
z
0
‘A
Tziz
j
z
j
ME
Sz
0
i
z
j‘A
S
z‘A
S
izj+ME
Sz
0
i
j‘A
S
z
j
‘A
S
iz
j
j
+ME
Sz
0
i
j‘A
S
z
i
‘A
S
iz
i
j :
The above concludes the proof of the bound for the empirical error.
We now turn to the leave-one-out error. The bound can be obtain in a similar way. Actually
we notice that if we rewrite the derivation for the empirical error we simply have to remove from
the training set the point at which the loss is computed. That is we simply have to replace all the
quantitiesoftheform ‘A
Tzz
0zby‘A
Tz
0z. Itiseasytoseethattheaboveresultsaremodied
in a way that gives the correct bound for the leave-one-out error.
Appendix B. Proof of Theorem 18
First we rewrite Inequality 15 in Theorem 17 as
P
S
"
RR
emp
2
+
4m
+1
r
1
2m
e
2
:
We introduce the following quantity
u 2
+
4m
+1
r
1
2m
and rewrite the above bound as
P
S
RR
emp
u
e
2
:
523

slide 26:

Bousquet and Elisseeff
We dene a sequence
k
k0
of real numbers such that
k
Be
k
:
We dene
k
t+
p
2lnk.
Nowweusetheunionboundtogetastatementthatholdsforallvaluesinthesequence
k
k1
:
P
S
9k1 RR
k
emp
u
k
k
X
k1
P
S
RR
k
emp
u
k
k
X
k1
e
2
k
X
k1
1
k
2
e
t
2
2e
t
2
:
For a given 2 0B consider the unique value k 1 such that
k
k1
. We thus have
k
e
k
.
The following inequalities follow from the denition of
k
1
k
e
R
k
emp
R
emp
p
2lnk
s
2lnln
B
k
s
2lnln
eB
so that we have
ut+
p
2lnk
k
2
e
+t+
4me
+1
r
1
2m
vt:
We thus get the following implication
RR
emp
vtRR
k
emp
ut+
p
2lnk
k
:
This reasoning thus proves that
P
S
9 20B RR
emp
vt
P
S
h
9k0 RR
k
emp
ut+
p
2lnk
k
i
and thus
P
S
9 20B RR
emp
vt
2e
t
2
which can be written as
P
S
"
9 20B RR
emp
2
e
+t+
4me
+1
r
1
2m
2e
t
2
and gives with probability 1
8 20B RR
emp
+2
e
+
p
ln1 +
s
2lnln
eB
4me
+1
r
1
2m
which gives the rst inequality. The second inequality can be proven in the same way.
524

slide 27:

Stability and Generalization
Appendix C. Convexity
For more details see Gordon 1999 or Rockafellar 1970. A convex function F is any function from
a vector spaceF to Rf1+1g which satises
F g+1 Fg
0
Fg +1 g
0
for all gg
0
2F and 201. A proper convex function is one that is always greater than 1 and
not uniformly +1. The domain of F is the set of points where F is nite. A convex function is
closed if its epigraph ffy : y Ffg is closed. The subgradient of a convex function at a point
g written Fg is the set of vectors a such that
Fg
0
Fg+hg
0
gai
for all g
0
.
Convexfunctionsarecontinuousontheinterioroftheirdomainanddi‹erentiableontheinterior
of their domain except on a set of measure zero. For a convex function F we dene the dual of F
noted F
by
F
asup
g
hagiFg:
Denoting byrFg
0
a subgradient of F in g
0
i.e. a member of Fg
0
we can dene the Bregman
divergence associated to F of g to g
0
by
d
F
gg
0
FgFg
0
hgg
0
rFg
0
i :
When F is everywhere di‹erentiable this is well dened since the subgradient is unique and non-
negative by the denition of the subgradient. Otherwise we can dene the generalized divergence
as
d
F
gaFg+F
ahgai
where a 2 F
. Notice that this divergence is also nonnegative. Moreover the fact that f is a
minimum of F inF is equivalent to
02Ff
which with the following relationship
a2FgFg+F
ahgai
gives
Ff+F
00
when f is a minimum of F inF.
When F is everywhere di‹erentiable it is easy to get
8g2F d
F
gfFgFf 30
otherwise using generalized divergences we have
8g2F
d
F
g
0FgFf: 31
References
N. Alon S. Ben-David N. Cesa-Bianchi and D. Haussler. Scale-sensitive dimensions uniform
convergence and learnability. Journal of the ACM 444:615–631 1997.
P. Bartlett For valid generalization the size of the weights is more important than the size of the
network Advances in Neural Information Processing Systems 1996.
525

slide 28:

Bousquet and Elisseeff
J.F. Bonnans and A. Shapiro. Optimization problems with perturbation a guided tour. Technical
Report 2872 INRIA April 1996.
O. Bousquet and A. Elissee‹. Algorithmic stability and generalization performance. In Neural
Information Processing Systems 14 2001.
L. Breiman. Bagging predictors. Machine Learning 24:123–140 1996a.
L. Breiman. Heuristics of instability and stabilization in model selection. Annals of Statistics 24
6:2350–2383 1996b.
T.M. Cover and J.A. Thomas. Elements of Information Theory. John Wiley 1991.
L. Devroye. Exponential inequalities in nonparametric estimation. In Nonparametric Functional
Estimation and Related Topics pages 31–44. Kluwer Academic Publishers 1991.
L. Devroye L. Gy¨ or and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer
Verlag 1996.
L.DevroyeandT.Wagner. Distribution-freeinequalitiesforthedeletedandholdouterrorestimates.
IEEE Transactions on Information Theory 252:202–207 1979a.
L.DevroyeandT.Wagner. Distribution-freeperformanceboundsforpotentialfunctionrules. IEEE
Transactions on Information Theory 255:601–604 1979b.
T. Evgeniou and M. Pontil and T. Poggio. A unied framework for Regularization Networks
and Support Vector Machines. A.I. Memo 1654 Massachusetts Institute of Technology Articial
Intelligence Laboratory December 1999.
G. Gordon. Approximate Solutions to Markov Decision Processes. PhD thesis Carnegie Mellon
University 1999.
T. Jaakola M. Meila and T. Jebara. Maximum entropy discrimination. In Neural Information
Processing Systems 12 1999.
M. Kearns and D. Ron. Algorithmic stability and sanity-check bounds for leave-one-out cross-
validation. Neural Computation 116:1427–1453 1999.
G. Lugosi and M. Pawlak. On the posterior-probability estimate of the error of nonparametric
classication rules. IEEE Transactions on Information Theory 402:475–481 1994.
C.McDiarmid. Onthemethodofboundeddi‹erences. In Surveys in Combinatorics pages148–188.
Cambridge University Press Cambridge 1989.
T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer
networks. In Science 2472:978–982 1990.
R.T. Rockafellar. Convex Analysis. Princeton University Press Princeton NJ 1970.
W. Rogers and T. Wagner. A nite sample distribution-free performance bound for local discrimi-
nation rules. Annals of Statistics 63:506–514 1978.
J.M.Steele. AnEfron-Steininequalityfornonsymmetricstatistics. Annals of Statistics 14:753–758
1986.
M. Talagrand. A new look at independence. Annals of Probability 24:1–34 1996.
V.N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag 1982.
G. Wahba. An introduction to model building with reproducing kernel hilbert spaces. Technical
Report Statistics Department TR 1020 University of Wisconsin Madison 2000.
526

You do not have the permission to view this presentation. In order to view it, please
contact the author of the presentation.

Send to Blogs and Networks

Processing ....

Premium member

Use HTTPs

HTTPS (Hypertext Transfer Protocol Secure) is a protocol used by Web servers to transfer and display Web content securely. Most web browsers block content or generate a “mixed content” warning when users access web pages via HTTPS that contain embedded content loaded via HTTP. To prevent users from facing this, Use HTTPS option.