Skip to content

Linear Regression

What is linear regression?

  1. A learning algorithm
  2. A prediction algorithm
  3. An optimization algorithm
  4. A machine learning algorithm

\(y = m \times x + c\)

y x m c
0 0 1 0
1 1 1 0
2 2 1 0
  • \(y = m \times x + c\): Model
  • \(x\): Input / Independent variable
  • \(y\): Prediction / Dependent variable
  • \(m, c\): Model parameters

What is the equation?

  • we don't have one yet

How to frame it?

  • Given a set of corresponding inputs and outputs,
  • Which line is 'best'?

How to define 'best'? Let's try.

By Inspection

Case 1: One point

y x
0 0
  • \(m =\) ?
  • \(c =\) ?

Case 2: Two points

y x
0 0
1 1
  • \(m =\) ?
  • \(c =\) ?

Case 3: Two points

y x
0 0
1 -1
  • \(m =\) ?
  • \(c =\) ?

Case 4: Three points

y x
0 0
1 1
2 2
  • \(m =\) ?
  • \(c =\) ?

Case 4: Three points

y x
0 0
1 1
2 1
  • \(m =\) ?
  • \(c =\) ?

  • Which is the best line?

Let's draw a new table:

\(m = 1; c = 0\)

y x err
0 0 0
1 1 0
2 1 1

\(m = 0.5; c = 0\)

y x err
0 0 0.0
1 1 0.5
2 1 0.0
  • So now which is the best line?
  • Why?

The Problem

Which is the best line?

  • That which minimizes the sum of error over each sample/example/datapoint/row

\(\mbox{Err}(\mbox{data} \, ; \, \mbox{model}) = \sum_i \mbox{err}(\mbox{row}_i, \mbox{model})\)

  • What is \(\mbox{err}\)?

\(\mbox{err}(\mbox{row} \, ; \, \mbox{model}) = \mbox{pred}_i - \mbox{expected}\)

i.e. we need both \(\mbox{row}\), and \(\mbox{expected}\) its corresponding label/supervision/annotation/ground-truth

  • Now, in the context of linear regression,
  • What is \(\mbox{row}\)?
  • What is \(\mbox{expected}\)?

  • Silly question:

  • Why do we sum over data and not models?

The Usual Scenario

  • What if error was simply difference?

\(m = 0.5; c = 0\)

y x err
0 0 0.0
1 2 +0.5
2 1 -0.5

Gives us error of 0, so...

  • We usually use error = mean squared error
  • Why not just absolute value? the reasons are not that simple, nor widely applicable. Each task must justify using MSE, or use other relevant metrics.
  • Is it possible to use other errors? yes

    • RMSE : Root mean squared error
    • Manhattan distance : Edges on grid
    • Levenshtein distance : Symbols
  • What does this affect?

  • It affects whether a solution can be found!
  • We can set things like tolerance and margin
  • We can check which samples are 'farthest' from model

The Actual Algorithm

Paraphrased from scipy source

def linregress(x, y):

    if x.size == 0 or y.size == 0:
        raise ValueError("Inputs must not be empty.")

    if np.amax(x) == np.amin(x) and len(x) > 1:
        raise ValueError("Cannot calculate a linear regression "
                         "if all x values are identical")

    n = len(x)
    xmean = np.mean(x, None)
    ymean = np.mean(y, None)

    # Average sums of square differences from the mean
    #   ssxm = mean( (x-mean(x))^2 )
    #   ssxym = mean( (x-mean(x)) * (y-mean(y)) )
    ssxm, ssxym, _, _ = np.cov(x, y, bias=1).flat

    slope = ssxym / ssxm
    intercept = ymean - slope*xmean

One can write an entire book about covariance matrices, so this is a nuanced topic.

How does the Algorithm work?

For a complete understanding, prerequisites for this section are

  • intuitions about signal and noise,
  • intuitions of a generating process.

Ideas for attack

  1. How can data leak: as entry to independence
  2. How to measure correlation: as entry to mean, std
  3. Signal and noise: eigen vectors
  4. Actual math derivation of cov as solution: we need to discover our threads here