This article is a summary note for CS3244 week 8 content. The lecture notes of this course used Caltech’s notes as reference.
1. Regularization
It constrains the model so that we cannot fit the noise
Let’s consider the polynomial model:
Constraining the weights
- Hard constraint: $H_{2}$ is constrained version of $H_{10}$ with $w_{q}=0$ for q > 2
- Soft-order constraint: $\sum\limits_{q=0}^{Q} w_{q}^{2} \leq C$
- We still have to min $\frac{1}{N} (Zw-y)^{T}(Zw-y)$
- But cannot choose w freely: subject to $w^{T}w \leq C$
- Solution is called $w_{reg}$ instead of $w_{lin}$ now.
Solving for $w_{reg}$
- Since we have to choose points within the red circle, the intuitive way is to choose points at the boundary, if we choose w as shown in the first figure in the picture above.
- The gradient to the $E_{in}$ eclipse at w is $\triangledown E_{in}$, which is orthogonal to the tangent.
- The gradient to the constraint is exactly the direction of vector w.
- w is not the minimum, as there is component along the tangent
- w achieves minimum when normal and $\triangledown E_{in}$ is at exactly opposite direction
- $\triangledown E_{in} \propto -w_{reg} = -2\frac{\lambda}{N}w_{reg}$ (Constant of proportionality chosen for convenience)
- $\triangledown E_{in} + -2\frac{\lambda}{N}w_{reg} = 0$ which is a solution to $\min \, E_{in}(w) + \frac{\lambda}{N}w^{T}w$
Augmented error
\begin{align} \min E_{aug}(w) &= E_{in}(w) + \frac{\lambda}{N} w^{T}w \\ &= \frac{1}{N} (Zw-y)^{T}(Zw-y) + \frac{\lambda}{N} w^{T}{w} \\ &= \frac{1}{N} [(Zw-y)^{T}(Zw-y) + \lambda w^{T}{w}] \end{align}
Corresponding to $\min E_{in}(w)$ subject to $w^{T}w \leq C$ (VC formulation)
The solution to (3) can be obtained by taking derivatives.
- $\triangledown E_{aug}(w) = 0 \implies Z^{T}(Zw-y)+\lambda w = 0$
- $w_{reg} = (Z^{T}Z + \lambda I)^{-1}Z^{T}y$ (with regularization)
- as opposed to $w_{lin} = (Z^{T}Z)^{-1}Z^{T}y$ (without regularization)
The result
Weight 'decay'
$\min E_{in}(w) + \frac{\lambda}{N}w^{T}w$ is called the weight decay.
- Let’s say we use gradient descent: w(t+1) = w(t)- η ∇$E_{in}$(w(t)) - 2 η $\frac{\lambda}{N}$ w(t)
= w(t)(1-2η $\frac{\lambda}{N}$)- η ∇$E_{in}$(w(t)) (shrink the weight first, then move to the suggested direction)
- Applies in the neural network: $w^{T}w = \sum\limits_{l=1}^{L}\sum\limits_{i=0}^{d^{l-1}}\sum\limits_{j=1}^{d^{l}}(w_{ij}^{l})^{2}$
Variations of weight decay
- Emphasis of certain weights: $\sum\limits_{q=0}^{Q} \gamma_{q}w_{q}^{2}$
- $\gamma_{q} = 2^{q}$: the regularizer is giving huge emphasis on higher order terms, trying to find as much as possible a low-order fit
- $\gamma_{q} = 2^{-q}$: try to find a high-order fit
- In neural networks: different layers get different $\gamma$
- Tikhonov regularizer: $w^{T}\Gamma^{T}\Gamma w$
Weight growth
We constain the weights to be large - bad!
- gives λ = 0 the optimal case ⇒ kills the regularization
Practical rule to choose a regularizer
- stohastic noise is `high-frequeny’
- deterministic noise is also non-smooth
- ⇒ constrain learning towards smoother hypotheses
- smaller weights correspond to smoother hypothesis
- The regularizer Ω = Ω(h)
- we minimize $E_{aug}(h) = E_{in}(h) +\frac{\lambda}{N}\Omega(h)$
- We have seen $E_{out}(h) \leq E_{in}(h) + \Omega(N,H,\sigma)$ from VC analysis
- $E_{aug}$ can be better than $E_{in}$ as a proxy for $E_{out}$
- There is a complementary relationship between regularization and the VC bound:
- VC bound deals with the complexity of the hypothesis set Ω(𝑁, 𝓗, 𝛿).
- Regularization concerns the complexity of the particular hypothesis Ω(h)
Choosing a regularizer
The perfect regularizer Ω is the one constrain in the direction of the target function.
- It is not matching the f
- It harms the overfitting more than it harms the fitting
- It harms fitting the noise more than it harms the signal
Guiding principle:
- move in the direction of smoother or ‘simpler’ (because the noise is not smooth)
- sacrifice a little bias for large improvement on variance
- What if a bad Ω is chosen? We still have λ!
Neural-network regularizers
- Weight decay: Refer to the figure 1 below, blue line is the activation function of the neurons.
- Blue-line is the soft threshold. If the signal is very small ⇒ almost linear, if the singnal is ver large ⇒ almost binary
- Weight elimination: Fewer weights ⇒ smaller VC dimension
- Soft weight elimination: (similar to weight decay) $\Omega(w) = \sum\limits_{i,j,l} \frac{(w_{ij}^{(l)})^{2}}{\beta^{2}+(w_{ij}^{(l)})^{2}}$
- for very small w, β dominates, Ω becomes something proportional to $w^{2}$, you are doing weight decay. (for small w, it is pushed to 0, eliminated)
- for very large w, w dominates, Ω becomes close to 1, nothing to do to change the weights
Early Stopping as a regularizer
- figure 2 below, stop before you get to the end.
- Regularization through the optimiser, instead of through the objective function
- Not a porper way of regularization
The optimal λ(figure 3)
2. Validation
$E_{out}(h) = E_{in}(h)$ + overfit penalty
- Regularization: estimate the 'overfit penalty' component
- Validation estimates $E_{out}$ component
Analyzing the estimate
Out-of-sample points (x,y), the error is e(h(x),y), $E[e(h(x),y)] = E_{out}(h)$, $var[e(h(x),y)] = \sigma^{2}$
- Squared error: $(h(x)-y)^{2}$
- Binary error: [[$h(x)\not =y$]]
On a validation set (pretend as a out-of-sample set) $(x_{1},y_{1}),…,(x_{K},y_{K})$, the error is $E_{val}(h) = \frac{1}{K} \sum\limits_{k=1}^{K} e(h(x_{k},y_{k})$
- $E[E_{val}(h)] = \frac{1}{K} \sum\limits_{k=1}^{K} E[e(h(x_{k},y_{k})] = E_{out}(h)$
- $var[E_{val}(h)] = \frac{1}{K^{2}} \sum\limits_{k=1}^{K} var[e(h(x_{k},y_{k})]= \frac{\sigma^{2}}{K}$
- $E_{val}(h) = E_{out}(h) \pm O(\frac{1}{\sqrt{K}})$
If K is larger, the estiate is better, however, K is part of the training set
- More K for validation ($D_{val}$), less N-K for taining ($D_{train}$)
- Samll K ⇒ bad estimate of $E_{out}$
- Large K ⇒ Huge real $E_{out}$ (as seen in the figure 1 below)
A way to solve this problem is to put K back into N. (figure 2 below)
- D = $D_{train} \cup D_{val}$
- The hypothesis trained on D: g
- The hypothesis trained on $D_{train}$: $g^{-}$
- Test $g^{-}$ on $D_{val}$ to yield $E_{val}(g^{-})$
- Give g (not $g^{-}$!), $E_{val}(g^{-}$ as result
- However, if we have large K, $g^{-}$ is more different than g, so $E_{out}(g^{-})$ will be a bad estimate of the real $E_{out}(g)$
- Rule of thumb: $K = \frac{N}{5}$
Why Validation?
- $D_{val}$ is used to make learning choices
- Recall figure 3 above, we can stop at lowest $E_{val}$
- We can choose the level of regularization to minimize $E_{val}$
What's the difference?
- Test set is unbiased, validation set has optimistic bias
- 2 hypotheses h1 and h2 with $E_{out}(h1) = E_{out}(h2) = 0.5$
- Error estimates e1 and e2 uniform on [0, 1]
- Pick h ∈ {h1, h2} with e = min(e1,e2)
- E[e] < 0.5 ( P(1 of the error < 0.5) = 75% ) ⇒ optimistic bias, but it’s not hurting much.
Model Selection
Using $D_{val}$ more than once. (refer to left figure below)
- M models $H_{1}, …, H_{M}$
- Use $D_{train}$ to learn $g_{m}^{-}$
- Evaluate each $g_{m}^{-}$ using $D_{val}$
- Pick the $H_{m}^{\ast}$, which $g_{m^{\ast}}^{-}$ with the smallest $E_{m}^{\ast}$ belongs to. (Bias introduced)
The bias
- $E_{val}(g_{m^{\ast}}^{-})$ is a biased estimate of $E_{out}(g_{m^{\ast}}^{-})$
- Refer to the right figure above, there are 2 models to choose and affect the curve. (each time I use the model which gives the smallest error)
- Why the curve goes up? Since the x-axis is the validation size K ⇒ training size N-K is decreasing along x-axis
- $E_{out}$ curve is the opposite of learning curve
- $E_{val}$ increases as less training size is used for training
- Why do the curves get closer together?
- With more K, estimates get more and more reliable ⇒ close to true error
How much bias?
- $D_{val}$ is used for “training” on the finalists model:
- $H_{val} = {g_{1}^{-},…,g_{M}^{-}}$
- By Hoeffding and VC: $E_{out}(g_{m^{\ast}}^{-}) \leq E_{val}(g_{m^{\ast}}^{-}) + O(\sqrt{\frac{\ln M}{K}})$
Data Contamination
- Error estimates: $E_{in},E_{out},E_{val}$
- Contamination: optimistic (deceptive) bias in estimating $E_{out}$
- Training set: totally contaminated ($E_{in}$ is no estimation to $E_{out}$)
- Validation set: slightly contaminated
- Test set: totally 'clean'
Cross Validation
The dilemma about K
- $E_{out}(g) \underset{\text{samll K}}{\approx} E{out}(g^{-}) \underset{\text{large K}}{\approx} E_{val}(g^{-})$
- can we have K both small and large?
Leave One Out
- N-1 points for training, 1 point for validation
- Let all N points take turns to be that 1 validation point
cross validation in action
- we try to choose no. of features by $E_{cv}$
- From the figure below, we may want to paramters = 6, as it has lower $E_{cv}$
- The result is shown in the right, we are better off with validation
K-fold cross validation
LOOCV can be very expensive for large datasets. You need to train N-1 different $g^{-1}$
Instead, we use K fold cross validation: i.e., K validation set with $\frac{N}{K}$ points each.
References
- Caltech Lecture 12, Regularization
- Caltech Lecture 13, Validation