Regressions: The Math Behind the Magic

When I was in Algebra 1, our teacher introduced us to the concept of linear regressions. We could go into Desmos, put random data points in a table, and watch the computer magically spit out the line of best fit with a push of a button. Not many of us stopped to think, “How the heck is it doing that?” We just took it for granted, because, hey, that wasn’t our problem.
Looking back at it, I still think there is some magic behind linear regressions. I mean, how many of you have sat down with a paper, a pencil, some random points, and the desire to find the best fitting line? And how many of you knew that rhetorical question is exactly what I’m going to do in this article?

What is a regression?

For those who need a refresher, regression models take a sample of data, such as points on a graph, and try to find a relationship between the independent parameters (x) and the output (g(x)). In my algebra class, we had points on a graph, and Desmos gave us a linear association between x and g(x) in the form of some function f(x), similar to the graphic below.

Linear Regressions

So, why don't we sit down and work out an example of linear regression - linear fits are one of the simplest and most frequently used regression models. Let our points be \(p ={(0,3), (2, 5), (4, 9), (6,10), (11, 20)}\). We can define a output function \(g(x)\) that matches the data: \(g(0)= 3, g(2) = 5\), and so on. Let's call the relation function \(f(x)\). It would seem logical to try to minimize some error function between the two functions, giving us the optimal line. However, then we run into a problem.

What should the error function be?

Method of Least Squares

A natural choice is the sum of the squares of the residuals:

\[\text{Error Function} = \sum_{i=1}^{n} (f(x_i)-g(x_i))^2\]

The terms \(r_i= f(x_i) - g(x_i)\) are the residuals of our prediction, the difference between the prediction and the actual observed values. We square them so negatives don’t cancel out, and bigger mistakes get more weight.

Minimizing this error function by varying f(x) gives us what is known in the biz as the Best Linear Unbiased Estimation (BLUE). To see what this means, let’s take it apart, piece by piece.

Estimation: The function f(x) is an estimation or prediction model.

Unbiased: When we minimize the error function, the residuals average to 0. Think of darts on a target:

Source: Noise: A Flaw in Human Judgment by Cass R. Sunstein, Daniel Kahneman, and Olivier Sibony

Imagine these two targets represent our residuals. Notice how team A’s shots are centered around the bullseye, while Team B’s are shifted down and to the left. Although both are clustered tightly, Team A’s shots are considered unbiased, because they roughly fall around the center of the target.

Linear: The function is a linear combination of the input(s). (Yes, you can also do a regression using multiple input parameters. We will talk about this later.)

Best: Among all the unbiased models, the one we want is the one with the smallest variance. To understand why we want this, let's go back to the targets.

Again, pretend these two targets are a metaphor for the residuals. Notice how both of them are roughly centered around the origin, so both are unbiased. However, Team A’s shots are more tightly clustered, so there is less variance. This gives us more confidence in their accuracy, compared to Team C.

We will revisit this idea of Best later. An intuitive proof requires a matrix representation, which is easier to approach with an example.

The main idea is we want our errors to cluster tightly around zero, like Team A’s shot clustering around the bullseye.

Finishing the Example

Let's apply the method of least squares to our example. Since \(f(x)\) is a linear function, we can write it as \(f(x)=ax+b\)

The residuals are then: \(r_i= ax_i+b-g(x_i)\).

So the error function becomes: \(\sum_{i=1}^{n} r^2_i = \sum_{i=1}^{n} (ax_i+b-g(x_i))^2= (0(a)+b -3 )^2+((2)a+b-5)^2+ \dots +((11)a+b-20)^2\)

Expanding and simplifying, we end up with a big quadratic expression in a and b:

\(\sum_{i=1}^{n} r^2_i = 177a^2 + 46ab + 5b^2 - 652a - 94b + 615\).

This large error function traces out a multivariable function that looks a lot like a parabola. This shouldn’t be too surprising, given that our function contains quadratic terms of a and b.

To find the minimum point, we can set the partial derivatives of the function to 0 and solve for \(a\) and \(b\).

In the end, the solutions come out to be \((\frac{549}{356},\frac{821}{356})\). Not the prettiest numbers, but it works as an example.

Looking back at what this means, our desired linear regression is \(y=\frac{549}{356}x+\frac{821}{356}\).

Plot below shows our original dataset, alongwith the linear fit we came up with.

A pretty good fit indeed!

Expanding on the Method of Least Squares

So why does the method of least squares give us the Best Linear Unbiased Estimation (BLUE)?

Let \(\mu\) be the mean of the data set. Then, the variance of the system is defined as:

\[\text{var}(x)=\frac{\sum_{i=1}^{n}(x_i-\mu)^2}{n}\]

Our \(\mu = 0\), because the residuals are unbiased. Since \(n\) is a constant, minimizing variance in our residuals is the same as minimizing

\[\sum_{i=1}^{n}e_i^2=\sum_{i=1}^{n}(y_i-\hat{y}_i)^2\]

which is the same as the method of least squares, so our procedure minimizes variance.

If this seems abstract, lets try a geometric view. In order to do this, we introduce matrix notations.

We have an observed output vector (our data), \(\textbf{y}\), and we assume that it is a linear function of an input matrix, \(\textbf{X}\), offset by some errors. In the matrix form, the system then can be written as: \(\textbf{y} = \textbf{X}\beta+\varepsilon\), where \(\beta\) is the correlation we are trying to find. Our prediction model will be \(\hat{y}=\textbf{X}\beta\). This allows us to compactly write the whole data set in one equation. In the last example:

\[\begin{bmatrix}3\\5\\9\\10\\20\end{bmatrix} = \begin{bmatrix}1 & 0\\1&2\\1&4\\1&6\\1&11\end{bmatrix} \begin{bmatrix}\beta_1\\\beta_2\end{bmatrix} + \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\varepsilon_3\\\varepsilon_4\\\varepsilon_5\end{bmatrix} \text{ and } \beta=\begin{bmatrix}\frac{821}{356}\\ \frac{549}{356}\end{bmatrix}\]

This allows for a geometric view using vector span.

diagram In the diagram to the right (Image Source: Wikipedia), the green area represents the column space of \(\textbf{X}\). The column space is the span of the columns of \(\textbf{X}\). Simply put, it is the space where all possible vectors \(\hat{y}=\textbf{X}\beta\) live. For the sake of visualization, the diagram is a 2D plane in a 3D space. Since \(y\) is usually not on the column space, any \(\hat{y}\) will be slightly off from the actual \(y\). This residual vector will be \(\varepsilon = y - \hat{y} = y - \textbf{X}\beta\). The goal is to minimize this vector’s length, or \(||\varepsilon||\). Since this is positive, it works to minimize \(||\varepsilon||^2\) instead. This works better by avoiding square roots. Also, notice that this is the same thing as our sum of the squares method, since this equals \(e^2_1+e^2_2+\dots+e^2_n = \sum_{i=1}^{n} (y-\hat{y})^2\).

Thus, by projecting y into the space of possible predictions, we’re guaranteed the smallest possible variance of errors. This proves that it minimizes the variation, and therefore satisfies the “Best” part of BLUE.

Adapting the method - Beyond Linear Regressions

So what if a line doesn't fit the data? What if I need a quadratic, or an exponential, or something even crazier?

Let’s start with polynomials first. Notice that the output can have \(n\) different inputs. If we let these be powers of \(x_i\), then the regression describes a polynomial. In other words, for a nth degree polynomial regression with \(m\) observations, if we let

\[\mathbf{X}= \begin{bmatrix} 1 & x_1 & x_1^2 & x_1^3 & \ldots & x_1^n \\ 1 & x_2 & x_2^2 & x_2^3 & \ldots & x_2^n \\ 1 & x_3 & x_3^2 & x_3^3 & \ldots & x_3^n \\ 1 & x_4 & x_4^2 & x_4^3 & \ldots & x_4^n \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ 1 & x_m & x_m^2 & x_m^3 & \ldots & x_m^n \end{bmatrix} \text{ ,then } \mathbf{x}_k^T\beta= \begin{bmatrix} 1+x_k+x_k^2+x_k^3+\cdots+x_k^n \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ \beta_3 \\ \vdots \\ \beta_n \end{bmatrix}=\beta_0+\beta_1x_k+\beta_2x_k^2+\beta_3x_k^3+\cdots+\beta_nx_k^n\]

where \(x^T_k\) is the kth row of \(\textbf{X}\). So, we can model any nth degree polynomial using this method!

What about exponentials? Exponentials can't be expressed as a linear combination of inputs, so we can’t directly apply our standard linear least squares method. So, is that it? Is our method limited to polynomials and nothing more?

Nonlinear Regressions

The general nonlinear regression doesn’t have a closed form for the best parameters.

However, this doesn’t mean we can’t solve them. For example, take our exponential regression of the form \(y= ab^xU\), where \(U\) is the error term. We can take the logarithm of both sides to get \(ln(y)=ln(a)+x ln(b)+ln(U)\). If we let \(x'=x, y'=ln(y), m =ln(b), c=ln(a), \varepsilon=ln(U)\), our equation becomes \(y'=mx'+c+\varepsilon\). Yay! Now we have a linear regression that we know how to solve! We can use the above method to solve for m and c, and exponentiate them to get a and b (remember, \(b=e^m,a=e^c\)).

Ingenious, right? Except there’s a small problem that you may have noticed. In our exponential regression, our error term is multiplicative, not additive. Does this mean the Method of Least Squares doesn’t apply anymore? In fact, the variance stays minimized, but the bias increases; this can be corrected using a smearing estimator as an approximation. I won’t go into the details here, but for those interested in learning more, I have included links at the end.

What about something harder, like \(y \sim ar^x+bs^x\) ? Or \(y \sim ax^b\)(notice how this differs from an exponential)? Or \(y \sim (ax^{2.718})+sin(bx)+log_c(x)\)!? As I said before, the general nonlinear regression doesn't have an exact solution. However, we can use gradient descent to fine tune the parameters to the desired level of precision. We start with random parameters. Then we take the gradient of our variation equation \(\sum_{i=1}^{n} (y_i-f(x_i:\beta))\), with respect to the parameters \(\beta_1, \beta_2, \dots ,\beta_n\). We use this gradient vector to find the direction of steepest descent. With each iteration, the parameters become more accurate, and we can repeat this process to the desired degree of precision. To see this method in action, scroll down to the gradient descent simulation below.

How well are we doing?

So, I've introduced all these ways to find different functions that can match the sample. But which one should you use for a particular data set? It goes without saying, you shouldn't be trying to a quadratic relationship with a data set that looks logarithmic. But, how could we quantify how well a regression fits the data? Well, we minimize the variance of the residuals, so that could be a good measure for the quality of the fit?

Yes, though it requires a minor tweak. For a data set with a starting high variation, no matter what kind of regression you use, the variation of our error function will also be high. This can be fixed by dividing out the starting variation, giving us \[\frac{\text{var}(\hat{y})}{\text{var}(y)} = \frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}\]

One last touch. Better fits mean lower residual variation, so our expression goes toward 0. To change this, we just subtract it from 1, so numbers near 1 indicate good association, while numbers near 0 indicate bad association. In the end, \[R^2 = 1 - \frac{\sum (y_i-\hat{y}_i)^2}{\sum (y_i-\bar{y})^2}\] giving us the well-known Correlation of Determination. Although it's written as \(R^2\), it can be negative if the prediction is bad enough that the model adds more variance into the system, rather than reducing variance.

In linear regressions with only one input parameter, it is also useful to know whether the association is positive or negative. For this we have \(r\), which is just the square root of the Coefficient of Determination, and negative or postive depending on whether the slope of the line is positive or negative.

Why they didn't just call the coefficient of determination \(r^2\), I don't know. But these two indicators can tell you a lot about the regression itself. For our example at the start, these coefficients are \(R^2\approx0.97764\), and \(r\approx0.98875\), indicating a strong positive relationship between x and y, so a linear regression is probably the best regression for the job.

Gradient Descent Simulation

Below is a simple simulation on how gradient descent works. The model starts out with random guesses for the parameters, leading to really bad fits, which can be seen in the \(R^2\) value. Pressing the "+Step" button moves the point in the direction of steepest descent to minimize the error function. The error function itself is shown as a 3D surface. I would recommend using "+Step" for the first few iterations, so you can see how the model starts moving in the general direction of the minimum. After that, you can use "+40 Steps" to watch it home in on the minimum. Refresh the page to try different types of regressions i.e. logarithmic, exponential, powers, etc.

Summary

So, now to answer the age old question in math:
When will I ever use this?
Anytime you have data. Regressions are used to predict housing prices, diagnose patients based on symptoms, understand climate change, and even train artificial intelligence. If you want to truly understand your data, regression is the key to unlocking its correlations and extracting meaningful insights.

Regressions are powerful tools in applied math. Built from the fundamentals of statistics and refined with calculus, they are used in a wide range of subjects. They have been used for centuries, and, despite their limitations, provide models that underpin many modern advancements. In fact, the field of regression analysis is still developing new methods to overcome its limitations.
But, for now, I think I’ll stick with drawing lines through points.

Resources

1. Simplest description of smearing estimators I could find: Smearing Estimators

2. Wikipedia's Series on Regression Analysis: Regression Analysis

3. An actual formula for calculating linear regressions: An Exact Formula for Linear Regressions

4. Desmos: 2D Graphing Calculator

5. Desmos 3D: 3D Graphing Calculator