Regularization
An interactive companion to the Konstanz 2026 lecture. Move the sliders.
1. The data-generating process
We fix a true regression function and observe noisy samples
with inputs drawn uniformly. Throughout this page we take , the canonical Bishop chapter-1 example, because it is non-polynomial, has interesting curvature, and is impossible to fit exactly with the polynomial function class we’ll use.
Our task is to learn an estimator from a training sample that approximates well on new .
Demo 1 · The data-generating process
true f(x) = sin(2π x) noisy training points
The two knobs that control the difficulty of the problem are and . With and large, any reasonable estimator will recover essentially perfectly. With large and small, even the true leaves residuals: the irreducible error that no estimator can dodge.
2. Polynomial regression by ordinary least squares
Choose a degree and use the polynomial function class
Stack the training inputs into a Vandermonde design matrix with . The OLS estimator minimizes
with the well-known closed form
(In the demos we work in the rescaled coordinate so that columns of stay bounded, but this is a numerical convenience; the math is unchanged.)
Demo 2 · Polynomial regression by OLS
Watch what happens as you crank up at fixed :
- Training MSE decreases monotonically; more parameters can always reduce in-sample error.
- Test MSE traces the famous U-shape: high at small (the model can’t represent ), high again at large (the model fits the noise), with a sweet spot in between.
This is the symptom of the bias–variance tradeoff. To diagnose it properly, we need the decomposition.
3. The bias–variance decomposition
What we ultimately care about is the total error, the expected squared error of our learned averaged over a fresh test input , the training set , and the test noise :
The trick is to fix first, decompose the inner expectation, and then take the outer one.
Step 1: pointwise decomposition. At a fixed ,
Sketch. Add and subtract inside the square, then , expand, and use together with to kill the cross terms. Three nonnegative pieces remain.
Step 2: integrate over . The outer passes through the equality term-by-term. The noise term is constant in (homoscedastic), so it survives unchanged:
Reading. Bias is how far the average learner is from the truth, a property of the function class. Variance is how much the learner wobbles across draws of , a property of how much it is allowed to chase noise. Noise is the floor we cannot beat.
The Monte Carlo. In this demo . We draw independent training sets of size , fit a degree- polynomial on each, and compute on a uniform grid . The pointwise quantities and are sample estimates over the resamples; the scalars shown below are uniform-grid averages over . Both approximations are simultaneously refined by the same draws.
Demo 3 · Live Monte-Carlo bias–variance decomposition
true f individual fits mean fit E[ŷ]
Pointwise decomposition · 𝔼[(y − ŷ)² | x] = Bias²(x) + Var(x) + σ²
𝔼ₓ[Bias²] + 𝔼ₓ[Var] + σ² = 0.0842 · empirical test MSE = 0.0832 (should match within MC error)
The top plot shows the ensemble of fits; the second plot is the integrand of as a function of , with the three components stacked. The top edge of the stack is the total expected error at each , and the area of each band integrates to its scalar in the legend below.
Things to try:
- Set (the constant fit). The orange Bias² band dominates and tracks , visibly larger near the peaks where is maximal. Variance is uniformly tiny; every training set produces nearly the same flat line.
- Set at small . Variance explodes, especially near and where high-degree polynomials wobble the most given thin boundary support. Bias² stays small everywhere.
- Slide up. The gray noise band rises uniformly across ; the integrated identity should still track the empirical test MSE within Monte Carlo error.
This is the picture we are trying to manipulate. Regularization is the lever.
4. Ridge regression: shrinking variance with
The simplest fix for high-variance regimes is to add an penalty on the weights:
We do not penalize the intercept; let . Setting the gradient to zero gives the closed form
Two limits to keep in mind: recovers OLS, and shrinks every non-intercept coefficient to zero (the prediction collapses to ).
Demo 4 · Ridge regression (L₂ penalty)
Coefficients ŵj
Pin the polynomial degree at the deliberately-too-large and sweep :
- At you get the unregularized fit: wiggly, hugging individual data points.
- At you typically get something close to the truth, even with .
- At you’ve pulled all coefficients toward zero and the curve flattens.
Geometrically, ridge replaces “minimize ” with “minimize subject to ” for some . The constraint is a Euclidean ball, so the solution is shrunk uniformly but rarely exactly zero.
5. Lasso: sparsity from bonus
This section is extra material — ℓ₁ regularization was not covered in the lecture. Skip ahead to §6 if you only want what was on the slides.
Switching the penalty to ,
changes the geometry. The constraint set is a polytope with corners on the coordinate axes, so the optimum tends to land on a corner, i.e., with several coefficients exactly zero.
There is no closed form, but the objective splits as with smooth (the squared loss) and non-smooth but separable (the ). This is the perfect setting for proximal gradient descent, also known here as ISTA:
where is the elementwise soft-threshold
(with the intercept entry left alone). The step size must satisfy where is an upper bound on the largest eigenvalue of ; we use the Frobenius bound , which is loose but free.
Demo 5 · Lasso (L₁ penalty) · sparsity
Coefficients · 12 / 12 non-zero
Push up and count how many coefficients have collapsed exactly to zero. Lasso isn’t doing rounding-to-zero; those weights are genuinely the optimum of the penalized objective. This is what makes it a feature-selection procedure.
6. Putting it together
Now run the bias–variance Monte Carlo with regularization turned on. At fixed degree , sweeping is moving along a path through model space that trades bias for variance smoothly.
Demo 3 · Live Monte-Carlo bias–variance decomposition
true f individual fits mean fit E[ŷ]
Pointwise decomposition · 𝔼[(y − ŷ)² | x] = Bias²(x) + Var(x) + σ²
𝔼ₓ[Bias²] + 𝔼ₓ[Var] + σ² = 0.2641 · empirical test MSE = 0.2608 (should match within MC error)
Recommended exercise: with , ridge selected, find the that minimizes total error. Then switch to lasso and find its optimum. Compare. Finally, switch the penalty off and observe how badly OLS does at the same . The gap is what regularization buys you.
7. The regularization path
So far we’ve moved sliders and watched single-shot views: a fit at one , an integrated decomposition at one . The natural next move is to sweep at fixed and watch how each piece of the total error
traces out a curve. This is the canonical bias–variance tradeoff plot:
Demo 7 · The regularization path
min total = 0.1145 at λ* = 6e-2 (this is what cross-validation is approximately searching for)
Read it left-to-right:
- Small . The penalty is too weak to discipline a degree- polynomial, so the variance curve dominates. Different training sets give wildly different fits.
- Large . Coefficients are crushed toward zero; the model can’t even represent , so the bias curve dominates.
- Somewhere in between the two curves cross, and the total error has a strict minimum at (red dashed). Below the total curve sits the irreducible noise floor (gray dashed). No procedure can ever go below it.
That minimum is the population-optimum hyperparameter, the value we would choose if we knew and . We don’t, of course. Cross-validation, AIC, BIC, marginal likelihood under a Gaussian prior, evidence approximation: every model-selection procedure you’ve seen is some way of estimating from data alone, with different sample-size and bias trade-offs of its own. But all of them are picking a point on this same curve.
Things to try:
- Push up to 15. The variance curve at small explodes (high-degree polynomials are very high-capacity); shifts right.
- Drop to 15. Variance curve rises across the board; shifts right again. Less data → more regularization needed.
- Switch to lasso. The bias curve rises in characteristic stair-steps as crosses thresholds where individual coefficients hit zero.