What is the bias/variance tradeoff?

If you're not the reading type, I made a four-part series on YouTube covering the Bias/Variance tradeoff. Onward.

Prepare yourself for some shocking news...

In machine learning, your models will seldom get it right. As the saying goes...

All models are wrong, but some are useful. - George E. P. Box

There are a few different ways to classify error in machine learning.

  1. Error due to bias
  2. Error due to variance
  3. Error due to the absurd, chaotic nature of the universe.

The goal for this post is to learn more about 1 and 2. The third type - although fun to contemplate - is a little outside the scope of Data Science.

The data

The dataset we'll use comes from the Intro to Statistical Learning book.

df = pd.read_csv('https://statlearning.com/s/Advertising.csv')
df.head()

The following is a sample of our dataset:

Unnamed: 0 TV radio newspaper sales
0 1 230.1 37.8 69.2 22.1
1 2 44.5 39.3 45.1 10.4
2 3 17.2 45.9 69.3 9.3
3 4 151.5 41.3 58.5 18.5
4 5 180.8 10.8 58.4 12.9

Imagine you're a data scientist for a hot new startup. Each row in the dataset represents a consumer market (eg New York, Seattle, Austin, etc).

The first three columns (TV, radio, newspaper) are your ad spend for said markets. The sales column represents how much you made in sales for that particular market.

With this data, we can create a model that predicts sales given a certain budget for TV, radio and newspaper. Because sales is a number, that makes this a regression problem.