Linear Regression
What is linear regression?
- A learning algorithm
- A prediction algorithm
- An optimization algorithm
- A machine learning algorithm
\(y = m \times x + c\)
| y | x | m | c |
|---|---|---|---|
| 0 | 0 | 1 | 0 |
| 1 | 1 | 1 | 0 |
| 2 | 2 | 1 | 0 |
- \(y = m \times x + c\): Model
- \(x\): Input / Independent variable
- \(y\): Prediction / Dependent variable
- \(m, c\): Model parameters
What is the equation?
- we don't have one yet
How to frame it?
- Given a set of corresponding inputs and outputs,
- Which line is 'best'?
How to define 'best'? Let's try.
By Inspection
Case 1: One point
| y | x |
|---|---|
| 0 | 0 |
- \(m =\) ?
- \(c =\) ?
Case 2: Two points
| y | x |
|---|---|
| 0 | 0 |
| 1 | 1 |
- \(m =\) ?
- \(c =\) ?
Case 3: Two points
| y | x |
|---|---|
| 0 | 0 |
| 1 | -1 |
- \(m =\) ?
- \(c =\) ?
Case 4: Three points
| y | x |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 2 |
- \(m =\) ?
- \(c =\) ?
Case 4: Three points
| y | x |
|---|---|
| 0 | 0 |
| 1 | 1 |
| 2 | 1 |
- \(m =\) ?
-
\(c =\) ?
-
Which is the best line?
Let's draw a new table:
\(m = 1; c = 0\)
| y | x | err |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 1 | 0 |
| 2 | 1 | 1 |
\(m = 0.5; c = 0\)
| y | x | err |
|---|---|---|
| 0 | 0 | 0.0 |
| 1 | 1 | 0.5 |
| 2 | 1 | 0.0 |
- So now which is the best line?
- Why?
The Problem
Which is the best line?
- That which minimizes the sum of error over each sample/example/datapoint/row
\(\mbox{Err}(\mbox{data} \, ; \, \mbox{model}) = \sum_i \mbox{err}(\mbox{row}_i, \mbox{model})\)
- What is \(\mbox{err}\)?
\(\mbox{err}(\mbox{row} \, ; \, \mbox{model}) = \mbox{pred}_i - \mbox{expected}\)
i.e. we need both \(\mbox{row}\), and \(\mbox{expected}\) its corresponding label/supervision/annotation/ground-truth
- Now, in the context of linear regression,
- What is \(\mbox{row}\)?
-
What is \(\mbox{expected}\)?
-
Silly question:
- Why do we sum over data and not models?
The Usual Scenario
- What if error was simply difference?
\(m = 0.5; c = 0\)
| y | x | err |
|---|---|---|
| 0 | 0 | 0.0 |
| 1 | 2 | +0.5 |
| 2 | 1 | -0.5 |
Gives us error of 0, so...
- We usually use error = mean squared error
- Why not just absolute value? the reasons are not that simple, nor widely applicable. Each task must justify using MSE, or use other relevant metrics.
-
Is it possible to use other errors? yes
- RMSE : Root mean squared error
- Manhattan distance : Edges on grid
- Levenshtein distance : Symbols
-
What does this affect?
- It affects whether a solution can be found!
- We can set things like tolerance and margin
- We can check which samples are 'farthest' from model
The Actual Algorithm
Paraphrased from scipy source
def linregress(x, y):
if x.size == 0 or y.size == 0:
raise ValueError("Inputs must not be empty.")
if np.amax(x) == np.amin(x) and len(x) > 1:
raise ValueError("Cannot calculate a linear regression "
"if all x values are identical")
n = len(x)
xmean = np.mean(x, None)
ymean = np.mean(y, None)
# Average sums of square differences from the mean
# ssxm = mean( (x-mean(x))^2 )
# ssxym = mean( (x-mean(x)) * (y-mean(y)) )
ssxm, ssxym, _, _ = np.cov(x, y, bias=1).flat
slope = ssxym / ssxm
intercept = ymean - slope*xmean
One can write an entire book about covariance matrices, so this is a nuanced topic.
How does the Algorithm work?
For a complete understanding, prerequisites for this section are
- intuitions about signal and noise,
- intuitions of a generating process.
Ideas for attack
- How can data leak: as entry to independence
- How to measure correlation: as entry to mean, std
- Signal and noise: eigen vectors
- Actual math derivation of
covas solution: we need to discover our threads here