Regularization

An interactive companion to the Konstanz 2026 lecture. Move the sliders.

1. The data-generating process

We fix a true regression function $f : R \to R$ and observe noisy samples

y_{i} = f (x_{i}) + ε_{i}, ε_{i} \sim iid N (0, σ^{2}),

with inputs $x_{i} \in [0, 1]$ drawn uniformly. Throughout this page we take $f (x) = sin (2 π x)$ , the canonical Bishop chapter-1 example, because it is non-polynomial, has interesting curvature, and is impossible to fit exactly with the polynomial function class we’ll use.

Our task is to learn an estimator $\hat{f}$ from a training sample $D = {(x_{i}, y_{i})}_{i = 1}^{n}$ that approximates $f$ well on new $x$ .

Demo 1 · The data-generating process

Training size n 30 Noise σ 0.20

true f(x) = sin(2π x) noisy training points

The two knobs that control the difficulty of the problem are $n$ and $σ$ . With $σ = 0$ and $n$ large, any reasonable estimator will recover $f$ essentially perfectly. With $σ$ large and $n$ small, even the true $f$ leaves residuals: the irreducible error $σ^{2}$ that no estimator can dodge.

2. Polynomial regression by ordinary least squares

Choose a degree $d \geq 0$ and use the polynomial function class

H_{d} = {\hat{f}_{w} (x) = \sum_{j = 0}^{d} w_{j} x^{j} : w \in R^{d + 1}} .

Stack the training inputs into a Vandermonde design matrix $X \in R^{n \times (d + 1)}$ with $X_{ij} = x_{i}^{j}$ . The OLS estimator minimizes

\overset{w}{^} = ar g w \in R^{d + 1} min ∥ y - X w ∥_{2}^{2},

with the well-known closed form

\overset{w}{^} = (X^{⊤} X)^{- 1} X^{⊤} y .

(In the demos we work in the rescaled coordinate $\tilde{x} = 2 x - 1 \in [- 1, 1]$ so that columns of $X$ stay bounded, but this is a numerical convenience; the math is unchanged.)

Demo 2 · Polynomial regression by OLS

Polynomial degree d 3 Training size n 20 Noise σ 0.25

train MSE0.0904

test MSE0.1016

Watch what happens as you crank $d$ up at fixed $n$ :

Training MSE decreases monotonically; more parameters can always reduce in-sample error.
Test MSE traces the famous U-shape: high at small $d$ (the model can’t represent $sin$ ), high again at large $d$ (the model fits the noise), with a sweet spot in between.

This is the symptom of the bias–variance tradeoff. To diagnose it properly, we need the decomposition.

3. The bias–variance decomposition

What we ultimately care about is the total error, the expected squared error of our learned $\hat{f}$ averaged over a fresh test input $x \sim p$ , the training set $D$ , and the test noise $ε$ :

R (\hat{f}) = E_{x \sim p} E_{D, ε} [(y - \hat{f}_{D} (x))^{2}] .

The trick is to fix $x$ first, decompose the inner expectation, and then take the outer one.

Step 1: pointwise decomposition. At a fixed $x$ ,

E_{D, ε} [(y - \hat{f}_{D} (x))^{2}] = Bias^{2} (x) (E_{D} [\hat{f}_{D} (x)] - f (x))^{2} + Var (x) Var_{D} [\hat{f}_{D} (x)] + noise σ^{2} .

Sketch. Add and subtract $\overset{ˉ}{f} (x) := E_{D} [\hat{f}_{D} (x)]$ inside the square, then $f (x)$ , expand, and use $E [ε] = 0$ together with $ε ⊥ D$ to kill the cross terms. Three nonnegative pieces remain.

Step 2: integrate over $x$ . The outer $E_{x \sim p}$ passes through the equality term-by-term. The noise term is constant in $x$ (homoscedastic), so it survives unchanged:

R (\hat{f}) = E_{x} [Bias^{2} (x)] + E_{x} [Var (x)] + σ^{2} .

Reading. Bias is how far the average learner is from the truth, a property of the function class. Variance is how much the learner wobbles across draws of $D$ , a property of how much it is allowed to chase noise. Noise is the floor we cannot beat.

The Monte Carlo. In this demo $p (x) = Uniform [0, 1]$ . We draw $K$ independent training sets of size $n$ , fit a degree- $d$ polynomial on each, and compute $\hat{f}_{D} (x_{g})$ on a uniform grid ${x_{g}}_{g = 1}^{G}$ . The pointwise quantities $Bias^{2} (x_{g})$ and $Var (x_{g})$ are sample estimates over the $K$ resamples; the scalars $E_{x} [\cdot]$ shown below are uniform-grid averages over $g$ . Both approximations are simultaneously refined by the same draws.

Demo 3 · Live Monte-Carlo bias–variance decomposition

Polynomial degree d 3 Training size n 25 Noise σ 0.25 MC resamples 60

true f individual fits mean fit E[ŷ]

Pointwise decomposition · 𝔼[(y − ŷ)² | x] = Bias²(x) + Var(x) + σ²

𝔼ₓ[Bias²] 0.0052 𝔼ₓ[Var] 0.0166 σ² 0.0625

𝔼ₓ[Bias²] + 𝔼ₓ[Var] + σ² = 0.0842 · empirical test MSE = 0.0832 (should match within MC error)

The top plot shows the ensemble of fits; the second plot is the integrand of $R (\hat{f})$ as a function of $x$ , with the three components stacked. The top edge of the stack is the total expected error at each $x$ , and the area of each band integrates to its scalar in the legend below.

Things to try:

Set $d = 0$ (the constant fit). The orange Bias² band dominates and tracks $∣ f (x) - \overset{ˉ}{f} ∣^{2}$ , visibly larger near the peaks $x \approx 0.25, 0.75$ where $∣ sin (2 π x) ∣$ is maximal. Variance is uniformly tiny; every training set produces nearly the same flat line.
Set $d = 14$ at small $n$ . Variance explodes, especially near $x = 0$ and $x = 1$ where high-degree polynomials wobble the most given thin boundary support. Bias² stays small everywhere.
Slide $σ$ up. The gray noise band rises uniformly across $x$ ; the integrated identity $E_{x} [Bias^{2}] + E_{x} [Var] + σ^{2}$ should still track the empirical test MSE within Monte Carlo error.

This is the picture we are trying to manipulate. Regularization is the lever.

4. Ridge regression: shrinking variance with $ℓ_{2}$

The simplest fix for high-variance regimes is to add an $ℓ_{2}$ penalty on the weights:

\overset{w}{^}_{ridge} (λ) = ar g w min ∥ y - X w ∥_{2}^{2} + λ ∥ w ∥_{2}^{2} .

We do not penalize the intercept; let $M = diag (0, 1, 1, \dots, 1)$ . Setting the gradient to zero gives the closed form

\overset{w}{^}_{ridge} (λ) = (X^{⊤} X + λ M)^{- 1} X^{⊤} y .

Two limits to keep in mind: $λ \to 0$ recovers OLS, and $λ \to \infty$ shrinks every non-intercept coefficient to zero (the prediction collapses to $\overset{y}{ˉ}$ ).

Demo 4 · Ridge regression (L₂ penalty)

λ (log scale) 1.0e-3 Polynomial degree d 12 Training size n 25 Noise σ 0.25

Coefficients ŵ_j

0.02

-3.16

-1.44

3.62

3.24

2.18

-0.42

0.34

-1.06

-1.29

w10

-0.43

w11

-2.43

w12

0.40

Pin the polynomial degree at the deliberately-too-large $d = 12$ and sweep $λ$ :

At $λ \approx 1 0^{- 6}$ you get the unregularized fit: wiggly, hugging individual data points.
At $λ \approx 1 0^{- 2}$ you typically get something close to the truth, even with $d = 12$ .
At $λ = 10$ you’ve pulled all coefficients toward zero and the curve flattens.

Geometrically, ridge replaces “minimize $∥ y - X w ∥^{2}$ ” with “minimize $∥ y - X w ∥^{2}$ subject to $∥ w ∥_{2} \leq t$ ” for some $t = t (λ)$ . The constraint is a Euclidean ball, so the solution is shrunk uniformly but rarely exactly zero.

5. Lasso: sparsity from $ℓ_{1}$ bonus

This section is extra material — ℓ₁ regularization was not covered in the lecture. Skip ahead to §6 if you only want what was on the slides.

Switching the penalty to $ℓ_{1}$ ,

\overset{w}{^}_{lasso} (λ) = ar g w min \frac{1}{2} ∥ y - X w ∥_{2}^{2} + λ ∥ w ∥_{1},

changes the geometry. The constraint set ${∥ w ∥_{1} \leq t}$ is a polytope with corners on the coordinate axes, so the optimum tends to land on a corner, i.e., with several coefficients exactly zero.

There is no closed form, but the objective splits as $f (w) + g (w)$ with $f$ smooth (the squared loss) and $g$ non-smooth but separable (the $ℓ_{1}$ ). This is the perfect setting for proximal gradient descent, also known here as ISTA:

w_{k + 1} = S_{λ t} (w_{k} - t X^{⊤} (X w_{k} - y)),

where $S_{α}$ is the elementwise soft-threshold

S_{α} (z) = sign (z) max (∣ z ∣ - α, 0)

(with the intercept entry left alone). The step size $t$ must satisfy $t \leq 1/ L$ where $L$ is an upper bound on the largest eigenvalue of $X^{⊤} X$ ; we use the Frobenius bound $L = ∥ X ∥_{F}^{2}$ , which is loose but free.

Demo 5 · Lasso (L₁ penalty) · sparsity

λ (log scale) 1.0e-2 Polynomial degree d 12 Training size n 25 Noise σ 0.25

Coefficients · 12 / 12 non-zero

0.17

-2.27

-0.47

1.35

0.03

0.93

0.13

0.28

0.19

-0.11

w10

0.19

w11

-0.32

w12

0.16

Push $λ$ up and count how many coefficients have collapsed exactly to zero. Lasso isn’t doing rounding-to-zero; those weights are genuinely the optimum of the penalized objective. This is what makes it a feature-selection procedure.

6. Putting it together

Now run the bias–variance Monte Carlo with regularization turned on. At fixed degree $d$ , sweeping $λ$ is moving along a path through model space that trades bias for variance smoothly.

Demo 3 · Live Monte-Carlo bias–variance decomposition

Polynomial degree d 10 Training size n 25 Noise σ 0.25 λ (log scale) 1.0e-3

penalty: none ridge (L₂) lasso (L₁)

true f individual fits mean fit E[ŷ]

Pointwise decomposition · 𝔼[(y − ŷ)² | x] = Bias²(x) + Var(x) + σ²

𝔼ₓ[Bias²] 0.0098 𝔼ₓ[Var] 0.1918 σ² 0.0625

𝔼ₓ[Bias²] + 𝔼ₓ[Var] + σ² = 0.2641 · empirical test MSE = 0.2608 (should match within MC error)

Recommended exercise: with $d = 10$ , ridge selected, find the $λ$ that minimizes total error. Then switch to lasso and find its optimum. Compare. Finally, switch the penalty off and observe how badly OLS does at the same $d$ . The gap is what regularization buys you.

7. The regularization path

So far we’ve moved sliders and watched single-shot views: a fit at one $d$ , an integrated decomposition at one $(d, λ)$ . The natural next move is to sweep $λ$ at fixed $d$ and watch how each piece of the total error

R (\hat{f}; λ) = E_{x} [Bias^{2} (x; λ)] + E_{x} [Var (x; λ)] + σ^{2}

traces out a curve. This is the canonical bias–variance tradeoff plot:

Demo 7 · The regularization path

Polynomial degree d 12 Training size n 30 Noise σ 0.25

penalty: ridge (L₂) lasso (L₁)

𝔼ₓ[Bias²] 𝔼ₓ[Var] σ² total R(f̂; λ)

min total = 0.1145 at λ* = 6e-2 (this is what cross-validation is approximately searching for)

Read it left-to-right:

Small $λ$ . The penalty is too weak to discipline a degree- $d$ polynomial, so the variance curve dominates. Different training sets give wildly different fits.
Large $λ$ . Coefficients are crushed toward zero; the model can’t even represent $f$ , so the bias curve dominates.
Somewhere in between the two curves cross, and the total error has a strict minimum at $λ^{*}$ (red dashed). Below the total curve sits the irreducible noise floor $σ^{2}$ (gray dashed). No procedure can ever go below it.

That minimum $λ^{*}$ is the population-optimum hyperparameter, the value we would choose if we knew $f$ and $σ^{2}$ . We don’t, of course. Cross-validation, AIC, BIC, marginal likelihood under a Gaussian prior, evidence approximation: every model-selection procedure you’ve seen is some way of estimating $λ^{*}$ from data alone, with different sample-size and bias trade-offs of its own. But all of them are picking a point on this same curve.

Things to try:

Push $d$ up to 15. The variance curve at small $λ$ explodes (high-degree polynomials are very high-capacity); $λ^{*}$ shifts right.
Drop $n$ to 15. Variance curve rises across the board; $λ^{*}$ shifts right again. Less data → more regularization needed.
Switch to lasso. The bias curve rises in characteristic stair-steps as $λ$ crosses thresholds where individual coefficients hit zero.

Regularization

1. The data-generating process

2. Polynomial regression by ordinary least squares

3. The bias–variance decomposition

4. Ridge regression: shrinking variance with ℓ2​

5. Lasso: sparsity from ℓ1​ bonus

6. Putting it together

7. The regularization path

4. Ridge regression: shrinking variance with $ℓ_{2}$

5. Lasso: sparsity from $ℓ_{1}$ bonus