Regularization

An interactive companion to the Konstanz 2026 lecture. Move the sliders.

1. The data-generating process

We fix a true regression function and observe noisy samples

with inputs drawn uniformly. Throughout this page we take , the canonical Bishop chapter-1 example, because it is non-polynomial, has interesting curvature, and is impossible to fit exactly with the polynomial function class we’ll use.

Our task is to learn an estimator from a training sample that approximates well on new .

Demo 1 · The data-generating process

00.20.40.60.81-1.5-1-0.500.511.5xy

true f(x) = sin(2π x) noisy training points

The two knobs that control the difficulty of the problem are and . With and large, any reasonable estimator will recover essentially perfectly. With large and small, even the true leaves residuals: the irreducible error that no estimator can dodge.

2. Polynomial regression by ordinary least squares

Choose a degree and use the polynomial function class

Stack the training inputs into a Vandermonde design matrix with . The OLS estimator minimizes

with the well-known closed form

(In the demos we work in the rescaled coordinate so that columns of stay bounded, but this is a numerical convenience; the math is unchanged.)

Demo 2 · Polynomial regression by OLS

00.20.40.60.81-2-1012xy
train MSE0.0904
test MSE0.1016
d3

Watch what happens as you crank up at fixed :

This is the symptom of the bias–variance tradeoff. To diagnose it properly, we need the decomposition.

3. The bias–variance decomposition

What we ultimately care about is the total error, the expected squared error of our learned averaged over a fresh test input , the training set , and the test noise :

The trick is to fix first, decompose the inner expectation, and then take the outer one.

Step 1: pointwise decomposition. At a fixed ,

Sketch. Add and subtract inside the square, then , expand, and use together with to kill the cross terms. Three nonnegative pieces remain.

Step 2: integrate over . The outer passes through the equality term-by-term. The noise term is constant in (homoscedastic), so it survives unchanged:

Reading. Bias is how far the average learner is from the truth, a property of the function class. Variance is how much the learner wobbles across draws of , a property of how much it is allowed to chase noise. Noise is the floor we cannot beat.

The Monte Carlo. In this demo . We draw independent training sets of size , fit a degree- polynomial on each, and compute on a uniform grid . The pointwise quantities and are sample estimates over the resamples; the scalars shown below are uniform-grid averages over . Both approximations are simultaneously refined by the same draws.

Demo 3 · Live Monte-Carlo bias–variance decomposition

00.20.40.60.81-2-1012xŷ over many training sets

true f individual fits mean fit E[ŷ]

Pointwise decomposition · 𝔼[(y − ŷ)² | x] = Bias²(x) + Var(x) + σ²

00.20.40.60.8100.050.10.150.20.25xexpected error at x
𝔼ₓ[Bias²] 0.0052 𝔼ₓ[Var] 0.0166 σ² 0.0625

𝔼ₓ[Bias²] + 𝔼ₓ[Var] + σ² = 0.0842  ·  empirical test MSE = 0.0832 (should match within MC error)

The top plot shows the ensemble of fits; the second plot is the integrand of as a function of , with the three components stacked. The top edge of the stack is the total expected error at each , and the area of each band integrates to its scalar in the legend below.

Things to try:

  1. Set (the constant fit). The orange Bias² band dominates and tracks , visibly larger near the peaks where is maximal. Variance is uniformly tiny; every training set produces nearly the same flat line.
  2. Set at small . Variance explodes, especially near and where high-degree polynomials wobble the most given thin boundary support. Bias² stays small everywhere.
  3. Slide up. The gray noise band rises uniformly across ; the integrated identity should still track the empirical test MSE within Monte Carlo error.

This is the picture we are trying to manipulate. Regularization is the lever.

4. Ridge regression: shrinking variance with

The simplest fix for high-variance regimes is to add an penalty on the weights:

We do not penalize the intercept; let . Setting the gradient to zero gives the closed form

Two limits to keep in mind: recovers OLS, and shrinks every non-intercept coefficient to zero (the prediction collapses to ).

Demo 4 · Ridge regression (L₂ penalty)

00.20.40.60.81-2-1012xy

Coefficients ŵj

w0
0.02
w1
-3.16
w2
-1.44
w3
3.62
w4
3.24
w5
2.18
w6
-0.42
w7
0.34
w8
-1.06
w9
-1.29
w10
-0.43
w11
-2.43
w12
0.40

Pin the polynomial degree at the deliberately-too-large and sweep :

Geometrically, ridge replaces “minimize ” with “minimize subject to ” for some . The constraint is a Euclidean ball, so the solution is shrunk uniformly but rarely exactly zero.

5. Lasso: sparsity from bonus

This section is extra material — ℓ₁ regularization was not covered in the lecture. Skip ahead to §6 if you only want what was on the slides.

Switching the penalty to ,

changes the geometry. The constraint set is a polytope with corners on the coordinate axes, so the optimum tends to land on a corner, i.e., with several coefficients exactly zero.

There is no closed form, but the objective splits as with smooth (the squared loss) and non-smooth but separable (the ). This is the perfect setting for proximal gradient descent, also known here as ISTA:

where is the elementwise soft-threshold

(with the intercept entry left alone). The step size must satisfy where is an upper bound on the largest eigenvalue of ; we use the Frobenius bound , which is loose but free.

Demo 5 · Lasso (L₁ penalty) · sparsity

00.20.40.60.81-2-1012xy

Coefficients · 12 / 12 non-zero

w0
0.17
w1
-2.27
w2
-0.47
w3
1.35
w4
0.03
w5
0.93
w6
0.13
w7
0.28
w8
0.19
w9
-0.11
w10
0.19
w11
-0.32
w12
0.16

Push up and count how many coefficients have collapsed exactly to zero. Lasso isn’t doing rounding-to-zero; those weights are genuinely the optimum of the penalized objective. This is what makes it a feature-selection procedure.

6. Putting it together

Now run the bias–variance Monte Carlo with regularization turned on. At fixed degree , sweeping is moving along a path through model space that trades bias for variance smoothly.

Demo 3 · Live Monte-Carlo bias–variance decomposition

penalty:
00.20.40.60.81-2-1012xŷ over many training sets

true f individual fits mean fit E[ŷ]

Pointwise decomposition · 𝔼[(y − ŷ)² | x] = Bias²(x) + Var(x) + σ²

00.20.40.60.810246xexpected error at x
𝔼ₓ[Bias²] 0.0098 𝔼ₓ[Var] 0.1918 σ² 0.0625

𝔼ₓ[Bias²] + 𝔼ₓ[Var] + σ² = 0.2641  ·  empirical test MSE = 0.2608 (should match within MC error)

Recommended exercise: with , ridge selected, find the that minimizes total error. Then switch to lasso and find its optimum. Compare. Finally, switch the penalty off and observe how badly OLS does at the same . The gap is what regularization buys you.

7. The regularization path

So far we’ve moved sliders and watched single-shot views: a fit at one , an integrated decomposition at one . The natural next move is to sweep at fixed and watch how each piece of the total error

traces out a curve. This is the canonical bias–variance tradeoff plot:

Demo 7 · The regularization path

penalty:
1e-51e-41e-31e-21e-11e01e100.511.522.5λ (log scale)expected error λ* = 6e-2
𝔼ₓ[Bias²] 𝔼ₓ[Var] σ² total R(f̂; λ)

min total = 0.1145  at  λ* = 6e-2 (this is what cross-validation is approximately searching for)

Read it left-to-right:

That minimum is the population-optimum hyperparameter, the value we would choose if we knew and . We don’t, of course. Cross-validation, AIC, BIC, marginal likelihood under a Gaussian prior, evidence approximation: every model-selection procedure you’ve seen is some way of estimating from data alone, with different sample-size and bias trade-offs of its own. But all of them are picking a point on this same curve.

Things to try:

  1. Push up to 15. The variance curve at small explodes (high-degree polynomials are very high-capacity); shifts right.
  2. Drop to 15. Variance curve rises across the board; shifts right again. Less data → more regularization needed.
  3. Switch to lasso. The bias curve rises in characteristic stair-steps as crosses thresholds where individual coefficients hit zero.