Linear Regression

What is linear regression?

A learning algorithm
A prediction algorithm
An optimization algorithm
A machine learning algorithm

\(y = m \times x + c\)

y	x	m
0	0	1
1	1	1
2	2	1

\(y = m \times x + c\): Model
\(x\): Input / Independent variable
\(y\): Prediction / Dependent variable
\(m, c\): Model parameters

What is the equation?

we don't have one yet

How to frame it?

Given a set of corresponding inputs and outputs,
Which line is 'best'?

How to define 'best'? Let's try.

By Inspection

Case 1: One point

y	x
0	0

\(m =\) ?
\(c =\) ?

Case 2: Two points

y	x
0	0
1	1

\(m =\) ?
\(c =\) ?

Case 3: Two points

y	x
0	0
1	-1

\(m =\) ?
\(c =\) ?

Case 4: Three points

y	x
0	0
1	1
2	2

\(m =\) ?
\(c =\) ?

Case 4: Three points

y	x
0	0
1	1
2	1

\(m =\) ?
\(c =\) ?
Which is the best line?

Let's draw a new table:

\(m = 1; c = 0\)

y	x	err
0	0	0
1	1	0
2	1	1

\(m = 0.5; c = 0\)

y	x	err
0	0	0.0
1	1	0.5
2	1	0.0

So now which is the best line?
Why?

The Problem

Which is the best line?

That which minimizes the sum of error over each sample/example/datapoint/row

\(\mbox{Err}(\mbox{data} \, ; \, \mbox{model}) = \sum_i \mbox{err}(\mbox{row}_i, \mbox{model})\)

What is \(\mbox{err}\)?

\(\mbox{err}(\mbox{row} \, ; \, \mbox{model}) = \mbox{pred}_i - \mbox{expected}\)

i.e. we need both \(\mbox{row}\), and \(\mbox{expected}\) its corresponding label/supervision/annotation/ground-truth

Now, in the context of linear regression,
What is \(\mbox{row}\)?
What is \(\mbox{expected}\)?
Silly question:
Why do we sum over data and not models?

The Usual Scenario

What if error was simply difference?

\(m = 0.5; c = 0\)

y	x	err
0	0	0.0
1	2	+0.5
2	1	-0.5

Gives us error of 0, so...

We usually use error = mean squared error
Why not just absolute value? the reasons are not that simple, nor widely applicable. Each task must justify using MSE, or use other relevant metrics.
Is it possible to use other errors? yes
- RMSE : Root mean squared error
- Manhattan distance : Edges on grid
- Levenshtein distance : Symbols
What does this affect?
It affects whether a solution can be found!
We can set things like tolerance and margin
We can check which samples are 'farthest' from model

The Actual Algorithm

Paraphrased from scipy source

def linregress(x, y):

    if x.size == 0 or y.size == 0:
        raise ValueError("Inputs must not be empty.")

    if np.amax(x) == np.amin(x) and len(x) > 1:
        raise ValueError("Cannot calculate a linear regression "
                         "if all x values are identical")

    n = len(x)
    xmean = np.mean(x, None)
    ymean = np.mean(y, None)

    # Average sums of square differences from the mean
    #   ssxm = mean( (x-mean(x))^2 )
    #   ssxym = mean( (x-mean(x)) * (y-mean(y)) )
    ssxm, ssxym, _, _ = np.cov(x, y, bias=1).flat

    slope = ssxym / ssxm
    intercept = ymean - slope*xmean

One can write an entire book about covariance matrices, so this is a nuanced topic.

How does the Algorithm work?

For a complete understanding, prerequisites for this section are

intuitions about signal and noise,
intuitions of a generating process.

Ideas for attack

How can data leak: as entry to independence
How to measure correlation: as entry to mean, std
Signal and noise: eigen vectors
Actual math derivation of cov as solution: we need to discover our threads here