Bias and Variance

Song Jiaming | 17 Nov 2017

Responsive image

This article is a summary note for CS3244 week 7 content. The lecture notes of this course used Caltech’s notes as reference.

Approximation-Generalization Tradeoff

Suppose we have a set of hypothesis functions H to approximate f. The ideal case is when H={ f }. However, that’s unlikely to happen.

To be more realistic, we hope our H have a small $E_{out}$, which means H is a good approximation of f for unseen sample (out-of-sample). However, there is a tradeoff:

  • Complex H (more hypothesis) → better chance of approximating f
  • Simple H (less hypothesis)→ better chance of generalizing unseen sample




Bias-Variance Analysis

We need a way to quantify the above tradeoff. The bias-variance analysis is one way,(VC Analysis is another way) it decomposes $E_{out}$ into 2 components:

  1. How well H can approximate f overall
  2. How well we can zoom in on a good h ∈ H


Approach: we apply our H to data and obtain a result, compare this result with the real-valued targets (the true result of data), then use squared error to quantify the difference.

  • Start with $E_{out}$:
    • Let g ∈ H be a hypothesis, D be a dataset which we trained our g on, x be another dataset which we use g to predict its results.
    • Then $E_{out}(g^{(D)}) = E_{x}[ (g^{(D)}(x) - f(x))^{2}]$


  • We want to generalizing over all N sets of D, (expectation of D)
    • $E_{D}[E_{out}(g^{(D)})] = E_{D}[E_{x}[ (g^{(D)}(x) - f(x))^{2}]]= E_{x}[\bbox[pink]{(E_{D}[(g^{(D)}(x) - f(x))^{2}]} ]$


  • To evaluate $E_{D}[(g^{(D)}(x) - f(x))^{2}]$, define average hypothesis $\bar{g}(x) = E_{D}[g^{(D)}(x)]$.
    • Imagine we have $D_{1},D_{2},…,D_{k}$, then $\bar{g}(x) \approx \frac{1}{K}\sum\limits_{k=1}^{K} g^{(D)}(x)$


  • Then, \begin{align} E_{D}[(g^{(D)}(x) - f(x))^{2}] &= E_{D}[(g^{(D)}(x)- \bar{g}(x) + \bar{g}(x) - f(x))^{2}] \\ &= E_{D}[(\bbox[lightblue]{g^{(D)}(x)- \bar{g}(x)})^{2} + (\bbox[lightGreen]{\bar{g}(x) - f(x)})^{2} + 2(\bbox[lightblue]{g^{(D)}(x)- \bar{g}(x)}\,)(\,\bbox[lightGreen]{\bar{g}(x) - f(x)})] \end{align}


  • Since $E_{D}(2(\bbox[lightblue]{g^{(D)}(x)- \bar{g}(x)}\,)(\,\bbox[lightgreen]{\bar{g}(x) - f(x)})) = (E_{D}[g^{(D)}(x)]- \bar{g}(x))(E_{D}[\bar{g}(x) - f(x)]) = (\bar{g}(x)-\bar{g}(x))\cdot C$
    • C is a constant


  • $\bbox[pink]{E_{D}[(g^{(D)}(x) - f(x))^{2}]} = \bbox[lightblue]{E_{D}[(\bar{g}(x) - f(x))]^{2}} + \bbox[lightgreen]{(\bar{g}(x) - f(x))^{2}}$
    • The 1st part highlighted in pink: how far away your hypthesis (on the specific D dataset) is from the true f
    • The 2nd part highlighted in blue: how far away is your hypothesis (on the specific D dataset) from the best possible hypothesis you can have (on all Ds)
      • This is the var(x)
      • Because this measures how far each hypothesis on each D is different from the best hypothesis
    • The 3rd part highlighted in green: how far away your best possible hypothesis is from the true f
      • This is also the bias(x)
      • Because that’s how the best hypothesis set is biased away from f


  • Finally, $E_{D}[E_{out}(g^{(D)})] = E_{x}[bias(x)+var(x)]$ = bias + var

Tradeoff

We want to argue that when one of bias or var goes up, the other one goes down.

  • Bias = $E_{x}[(\bar{g}(x) - f(x))^{2}]$
  • Variance = $E_{x}[E_{D}[(\bar{g}(x) - f(x))]^{2}]$


As seen in the figure below:

  • When only 1 H exists, the distance is far from f. When there are more H, on average, they are nearer to f than just 1 single H.
  • However, for more H, $\bar{g}$ can be at anywhere, where for 1 H, it is the only $\bar{g}$

Thus, As H increases, bias decreases, but variance increases. Bias-Variance


Example: Sine targets

sine Though $H_{0}$ has high $E_{out}$, it has lower variance, so always match ‘model complexity’ to the data resources, not to the target complexity.




Learning Curve

Given a dataset D of size N, we have:

  • Expected out-of-sample error: $E_{D}[E_{out}(g^{(D)})]$
  • Expected in-sample error: $E_{D}[E_{in}(g^{(D)})]$


Learning curve is to plot the relationship of how these errors vary with N.

  • For simple model, the expected error is higher, but discrepancy between 2 errors is lower

learning curve


Linear Regression Case

  • Noisy target y = $(w^{\ast})^{T}x$ + noise
  • Dataset D = ${(x_{1},y_{1}),…(x_{N},y_{N})}$
  • Linear regression solution: $w = (X^{T}X)^{-1}X^{T}y$
  • In-sample error vector = Xw - y
  • ‘Out-of-sample’:
    • Generate another target y’ = $(w^{\ast})^{T}x$ + noise’
    • Error vector = Xw - y’

learning curve




Overfitting

Illustration

Suppose we have 5 data points,

  • A simple target function does not fit them perfectly, there must be noise
  • A 4th-order polynomial function fit them perfectly, $E_{in}=0$ but $E_{out}$ is huge ⇒ overfitting
  • Reason: if 3rd order polynomial is used, $E_{in}$ won’t be 0, but $E_{out}$ will be reduced. So overfitting happens at 4th order, not 3rd order.

Overfitting: fitting the data more than is warranted. (When $E_{in}$ goes down, $E_{out}$ goes up.) It is fitting the noise, it assume noise as a pattern in your sample, which is of course not the case in out-of-sample

Case Study

  • We first create dataset by using the 2 polynomials.
  • Then, choose another 2 functions,2nd order and 10th order to fit the datasets in the left figure. Overfitting

Irony of the 2 learners: O (overfitting) and R (restricted)
We only tell the 2 learners that the target function f is a 10th-order polynomial, and you have to find it by the 15 points given

  • O: why not just choose 10th order, $H_{10}$
  • R: Only 15 points? I will just choose $H_{2}$ and pray for luck
  • So when is $H_{2}$ better, when is $H_{10}$ better?
  • Even when there’s no noise, irony(overfitting) still occurs Overfitting2


The role of Noise

A detailed experiement
Impact of noise lever and target complexity: $y = f(x) + \epsilon(x) = \bbox[pink]{\sum\limits_{q=0}^{Q_{f}} \alpha_{q} x^{q} }+ \epsilon(x)$

  • $\epsilon(x) = \sigma^{2}$, the noise
  • $Q_{f}$: Order of the polynomial f, implies the target complexity
  • The pink part is the polynomial equation up to degree $Q_{f}$.
    • and it’s normalised. (From one degree to another, they are orthogonormal to each other.)
  • Dataset size: N


Measure Overfitting:

  • still using 2nd-order $g_{2}\in H_{0}$ and 10th order $g_{10}\in H_{10}$
  • Overfit measure: $E_{out}(g_{10})- E_{out}(g_{2})$ =
    • Positive: More complex one has worse out-of-sample error ⇒ overfitting
    • Negative: More complex one did better ⇒ not overfitting
    • 0: they are the same


Running 10 millions iterations for the whole process, including:

  • Generate a target
  • Generate a dataset
  • fit both function
  • look at the overfit measure

Then we obtain: Overfitting3


Compare Stochastic Noise Deterministic Noise
Description Fluctuations/measurement errors we cannot capture The part of f we lack of capacity to capture
formula $y_{n} = f(x)$ + stochastic $y_{n} = h^{\ast}(x)$+ deterministic
Change H Remain the same for different hypothesis set Depends on your hypothesis set
Re-fit the same x Changes Stay the same


Noise and Bias-Variance

Recall that $E_{D}[E_{out}(g^{(D)})]$ = bias + var
A noisy target (instead of just f) is $y = f(x) + \epsilon(x)$, where $E[\epsilon(x)] = 0$
A noise term: \begin{align} E_{D,\epsilon}[ (g^{(D)}(x)-y)^{2} ] &= E_{D,\epsilon}[ (g^{(D)}(x)-f(x) - \epsilon(x))^{2} ] \\ &= E_{D,\epsilon}[ (g^{(D)}(x)- \bar{g}(x) + \bar{g}(x) -f(x) - \epsilon(x))^{2} ] \\&= E_{D,\epsilon}[ \bbox[lightblue]{(g^{(D)}(x)- \bar{g}(x))^{2}} + \bbox[lightgreen]{(\bar{g}(x) -f(x))^{2}} + \bbox[pink]{(\epsilon(x))^{2}} + \text{cross terms} ]\end{align}

  • $\bbox[lightblue]{blue}$ part is variance
  • $\bbox[lightgreen]{green}$ part is bias, the deterministic noise
  • $\bbox[pink]{pink}$ part is $\sigma^{2}$, the stochastic noise.


Two Cures for Overfitting

  1. Regularization: restrain the model
  2. Validation: reality check by peeking at the bottom line



References

  1. Caltech Lecture 8, Bias-Variance Tradeoff
  2. Caltech Lecture 11, Overfitting