From Least Squares to Likelihood: The Dual Nature of Linear Regression

June 29, 2025

This post is accessible to readers with a basic understanding of matrix algebra and probability. If you're new to those topics, don’t worry — my upcoming posts will cover the essential foundations of linear algebra and probability for machine learning. Stay tuned!

Introduction

Imagine you want to predict the price $y$ of a house in City A. Naturally, the price depends on various factors — the size of the house, its location (proximity to schools, hospitals, etc.), its age, and perhaps a dozen more. In machine learning, we represent these influencing factors as a feature vector: $X = [x_1, x_2, ..., x_n]$ , where each $x_i$ is a numerical representation of some property of the house.

Our goal is to learn a relationship between $X$ and $y$ , so we can predict the price of new houses just from their features. One of the most classic ways to approach this is through linear regression:

y \approx w_0 + w_1 x_1 + w_2 x_2 + \dots + w_n x_n = w^\top x

This is the least squares perspective: find the weights $w_1, ..., w_n$ and bias $w_0$ that minimize the total squared error between predicted and actual prices across the training data.

In practice, no matter how many features we include, there will always be unpredictable or unmeasured influences — random fluctuations, measurement noise, and real-world complexities we choose to ignore for simplicity. That's why we don't write our model as $y = w^\top x$ , but rather:

y = w^\top x + \varepsilon

The $\varepsilon$ term captures this uncertainty — the deviation between the actual observed value and the model’s idealized prediction. It reminds us that the model is a simplification. As the statistician George E. P. Box famously said:

"All models are wrong, but some are useful."

— George E. P. Box

Thinking of $\varepsilon$ as a random variable — specifically, as a sample from a probability distribution — turns this from a deterministic equation into a probabilistic model. And as we’ll see next, this small conceptual shift leads us directly to maximum likelihood estimation.

The Least Squares Perspective

Before diving into the math, let’s address one notational point. In the previous section, we used $w_0$ to represent the intercept term. For clarity and to align with standard convention, we’ll now write the intercept as $b$ , and keep the rest of the weights as a vector $w = [w_1, w_2, ..., w_n]^\top$ . This allows us to express our model compactly as:

y = w^\top x + b

Now suppose we are given a dataset of $n$ examples, where each data point consists of a feature vector $x^{(i)} \in \mathbb{R}^d$ and a corresponding target value $y^{(i)} \in \mathbb{R}$ . Our goal is to find the parameters $w$ and $b$ that best fit the data — in other words, that produce predictions $w^\top x^{(i)} + b$ that are as close as possible to the actual observed values $y^{(i)}$ .

To make this concrete, here’s a snapshot of a real dataset used for regression: the California Housing dataset. Each row represents one example (a house), and each column is a numerical feature such as the median income of the area, average number of rooms, etc. The target value $y$ is the median house value.

In our notation, each row corresponds to a feature vector $x^{(i)}$ , and the final column (the target) corresponds to $y^{(i)}$ . If the dataset has $n$ rows and $d$ features, then:

$X \in \mathbb{R}^{n \times d}$
is the matrix of feature vectors
$y \in \mathbb{R}^n$ is the vector of target values

To measure how good our predictions are, we define the squared error loss for each data point:

\text{Error}^{(i)} = \left(y^{(i)} - w^\top x^{(i)} - b\right)^2

In machine learning, such loss functions are typically called objective functions — mathematical expressions we aim to minimize (or maximize) using optimization techniques like gradient descent or closed-form solvers. In this case, we use the squared error because it's:

Differentiable, which makes it easy to optimize using calculus-based methods,
Convex, so it guarantees a unique global minimum

Here's a visual intuition for what the squared error captures: Each blue point represents a house in our dataset, plotted against a single feature (average number of rooms). The red line is the regression model’s prediction. The dashed vertical lines show the errors — the distance between the true value and the predicted value for each data point.

Scatter plot with regression line and residuals

The squared error loss function penalizes these vertical distances — especially large ones — and finds the line that minimizes the total penalty over the dataset.

The total error over the dataset is then the sum of these individual squared errors. This gives us the least squares objective function:

\mathcal{L}(w, b) = \sum_{i=1}^{n} \left(y^{(i)} - w^\top x^{(i)} - b\right)^2

The optimal weights $w^*$ and intercept $b^*$ are those that minimize this loss:

\min_{w, b} \ \mathcal{L}(w, b)

This optimization problem defines the classical formulation of linear regression. It’s purely geometric and algebraic: we are finding the linear function that best fits the data by minimizing squared deviation.

To derive a closed-form solution, we express the entire dataset in matrix form. Let $X \in \mathbb{R}^{n \times d}$ be the matrix of feature vectors, where each row is a data point $x^{(i)\top}$ and $y \in \mathbb{R}^{n}$ be the vector of target values.

To incorporate the intercept $b$ , we augment $X$ with a column of ones, forming $\tilde{X} \in \mathbb{R}^{n \times (d+1)}$ :

\tilde{X} = \begin{bmatrix} x^{(1)\top} & 1 \\ x^{(2)\top} & 1 \\ \vdots & \vdots \\ x^{(n)\top} & 1 \end{bmatrix}

Similarly, we define the augmented parameter vector:

\tilde{w} = \begin{bmatrix} w \\ b \end{bmatrix} \in \mathbb{R}^{d+1}

Now the prediction for all data points becomes simply:

\hat{y} = \tilde{X} \tilde{w}

And the least squares loss can be written compactly as:

\mathcal{L}(\tilde{w}) = \|y - \tilde{X} \tilde{w}\|_2^2

This is a standard quadratic minimization problem. Let’s expand the squared loss function $\mathcal{L}(\tilde{w}) = \|y - \tilde{X} \tilde{w}\|_2^2$ into a fully expanded quadratic form:

\mathcal{L}(\tilde{w}) = (y - \tilde{X} \tilde{w})^\top (y - \tilde{X} \tilde{w})

Expanding this using standard matrix identities:

= y^\top y - 2 \tilde{w}^\top \tilde{X}^\top y + \tilde{w}^\top \tilde{X}^\top \tilde{X} \tilde{w}

Before we compute the gradient, let's confirm that $\tilde{w}^\top \tilde{X}^\top \tilde{X} \tilde{w}$ is indeed a scalar (a single number), which justifies the symmetry trick used later in the derivation.

Recall the matrix dimensions:

$\tilde{X} \in \mathbb{R}^{n \times (d+1)}$ — data matrix with bias column added
$\tilde{w} \in \mathbb{R}^{(d+1) \times 1}$ — parameter vector

Then we have:

\tilde{w}^\top \tilde{X}^\top \tilde{X} \tilde{w} \in \mathbb{R}^{1 \times 1}

Here's the reasoning:

$\tilde{X} \tilde{w} \in \mathbb{R}^{n \times 1}$ — the predicted values
$\tilde{X}^\top (\tilde{X} \tilde{w}) \in \mathbb{R}^{(d+1) \times 1}$
$\tilde{w}^\top (\tilde{X}^\top \tilde{X} \tilde{w}) \in \mathbb{R}^{1 \times 1}$

Therefore, the whole expression collapses to a single real number. Since it's a scalar, we can freely take its transpose:

\tilde{w}^\top \tilde{X}^\top \tilde{X} \tilde{w} = (\tilde{w}^\top \tilde{X}^\top \tilde{X} \tilde{w})^\top = \tilde{w}^\top (\tilde{X}^\top \tilde{X})^\top \tilde{w}

Which justifies the simplification in the gradient derivation.

We now take the gradient of $\mathcal{L}(\tilde{w})$ with respect to $\tilde{w}$ . Using standard vector calculus identities (listed below), we obtain:

\nabla_{\tilde{w}} \mathcal{L} = -2 \tilde{X}^\top y + 2 \tilde{X}^\top \tilde{X} \tilde{w}

Setting the gradient to zero gives the first-order condition:

-2 \tilde{X}^\top y + 2 \tilde{X}^\top \tilde{X} \tilde{w} = 0

Cancelling the constant and rearranging, we get the normal equation:

\tilde{X}^\top \tilde{X} \tilde{w} = \tilde{X}^\top y

And assuming $\tilde{X}^\top \tilde{X}$ is invertible, the unique solution is:

\tilde{w}^* = (\tilde{X}^\top \tilde{X})^{-1} \tilde{X}^\top y

This is the closed-form solution to the least squares problem, derived directly from calculus and matrix algebra.

📘 Matrix Derivative Reference:

The following identities were used in the derivation:

$\frac{\partial}{\partial x} (x^\top A) = A$ (if $A$ is constant)
$\frac{\partial}{\partial x} (Ax) = A$
$\frac{\partial}{\partial x} (x^\top A x) = A^\top x + A x$
(or $2 A x$ if $A$ is symmetric)

These results are foundational in matrix calculus and commonly appear in optimization for machine learning.

This is the closed-form solution to the linear regression problem. It gives us the optimal weights and intercept that minimize the total squared error across the training data.

But there’s also a powerful geometric interpretation here. If we look closely, we can see that linear regression is performing an orthogonal projection of the target vector $y \in \mathbb{R}^n$ onto the column space of the design matrix $\tilde{X}$ .

In other words, the predicted values $\hat{y} = \tilde{X} \tilde{w}$ live in the subspace spanned by the feature vectors (the columns of $\tilde{X}$ ). Among all possible vectors in this subspace, $\hat{y}$ is the one closest to the true output $y$ in Euclidean distance.

This makes $\hat{y}$ the best linear approximation of $y$ using the available features. The residual vector $y - \hat{y}$ is orthogonal to the subspace spanned by the feature vectors.

The Probabilistic Interpretation: Linear Regression as Maximum Likelihood

In the previous section, we viewed linear regression purely as a geometric problem: we projected the target vector $y$ onto the column space of the design matrix $\tilde{X}$ . But notice what was missing: we made no assumptions about uncertainty or randomness. The model simply sought the best deterministic fit.

In this section, we take a different approach. Instead of treating $y$ as fixed, we now treat it as a random variable. Specifically, we assume that the outputs are generated from a linear model plus some random noise:

y^{(i)} = w^\top x^{(i)} + b + \varepsilon^{(i)}

We model the noise as Gaussian:

\varepsilon^{(i)} \sim \mathcal{N}(0, \sigma^2)

But why assume the noise $\varepsilon$ is Gaussian in the first place? This assumption may seem arbitrary, but it actually arises from a deep statistical principle: maximum entropy.

Among all probability distributions with a given mean and variance, the Gaussian distribution is the one with the highest entropy — meaning it makes the fewest assumptions beyond those two constraints. In other words, it’s the most "uninformative" distribution consistent with knowing only the second moment.

This aligns with a core generalization principle in machine learning: when we model uncertainty, we should choose distributions that inject as little bias as possible beyond what the data dictates. The Gaussian is a natural choice in this regard, especially within the exponential family of distributions of distributions, which form the foundation of many standard ML models.

The Gaussian noise assumption implies that for each data point, the target value $y^{(i)}$ is normally distributed around the linear prediction $w^\top x^{(i)} + b$ with variance $\sigma^2$ :

p(y^{(i)} \mid x^{(i)}) = \mathcal{N}(y^{(i)} \mid w^\top x^{(i)} + b, \sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - w^\top x^{(i)} - b)^2}{2\sigma^2} \right)

The goal of maximum likelihood is to choose parameters $w$ and $b$ that maximize the probability of observing the entire dataset. Assuming the data points are independent, we can write the likelihood as:

\mathcal{L}(w, b \mid x^{(1)}, \dots, x^{(n)}) = \prod_{i=1}^{n} p(y^{(i)} \mid x^{(i)}; w, b)

Plugging in the Gaussian formula for each term:

\mathcal{L}(w, b) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(y^{(i)} - w^\top x^{(i)} - b)^2}{2\sigma^2}\right)

Using the product rule for exponentials, we simplify the product:

\mathcal{L}(w, b) = \left(\frac{1}{\sqrt{2\pi\sigma^2}}\right)^n \cdot \exp\left( -\frac{1}{2\sigma^2} \sum_{i=1}^{n} (y^{(i)} - w^\top x^{(i)} - b)^2 \right)

To make optimization easier, we take the logarithm of the likelihood function — known as the log-likelihood. Taking the negative of this expression transforms the maximization problem into a minimization problem:

-\log \mathcal{L}(w, b) = \frac{n}{2} \log(2\pi\sigma^2) + \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y^{(i)} - w^\top x^{(i)} - b)^2

Notice that the first term $\frac{n}{2} \log(2\pi\sigma^2)$ is constant with respect to $w$ and $b$ , so it has no effect on the optimization.

To make this explicit, let’s compute the partial derivatives with respect to the parameters:

\frac{\partial}{\partial w}, \frac{\partial}{\partial b} \left[-\log \mathcal{L}(w, b)\right] = 0 + \frac{\partial}{\partial w}, \frac{\partial}{\partial b} \left[ \frac{1}{2\sigma^2} \sum_{i=1}^{n} (y^{(i)} - w^\top x^{(i)} - b)^2 \right]

This shows that the constant term vanishes under differentiation, and what remains is proportional to the gradient of the squared loss function.

Therefore, maximizing the log-likelihood is equivalent to minimizing the following (negative log-likelihood):

\sum_{i=1}^{n} (y^{(i)} - w^\top x^{(i)} - b)^2

This is exactly the same loss we derived from a geometric perspective — the squared error — now obtained from a probabilistic foundation.

\min_{w, b} \sum_{i=1}^{n} (y^{(i)} - w^\top x^{(i)} - b)^2

In other words, least squares estimation is equivalent to maximum likelihood estimation under a Gaussian noise model. What was previously a purely geometric objective now emerges from a probabilistic framework.

This new perspective allows us to reason probabilistically about our model’s uncertainty and make confidence estimates

Conclusion

We’ve now seen linear regression from two fundamental perspectives:

The geometric view: linear regression projects the target $y$ onto the column space of $\tilde{X}$ , finding the best linear approximation in Euclidean space.
The probabilistic view: assuming Gaussian noise and applying maximum likelihood leads to the same squared error loss — but this time, derived from a principled statistical model.

This duality is foundational to how we design and understand learning algorithms. Whether we interpret linear regression as a projection or as a likelihood optimizer, we’re using the same underlying mathematics — but with different implications and extensions.

📌 This post was inspired by a lecture and exam question from the Machine Learning (CS376) course at KAIST, taught by Professor Noseong Park.

Back to Blog