For my own purposes of learning, I’ve decided to document all of my interactions with machine learning in a blog series that will start from the basis and work it’s way northward. Before getting to linear regression and what it is, there’s some other aspects of machine learning that I want to address (mostly for myself). With that said, let’s just dive in!

Machine learning is an aspect of artificial intelligence that allows computers to learn something without being explicitly programmed to do so. A classic example of a problem that fits machine learning well is trying to teach a computer how to drive a car. Without getting into the details right off the bat, it turns out that it is almost impossible to write a program by hand – with all the necessary rules and models that a computer needs, to correctly navigate a roadway. The solution to this problem is to write a ML program that can model how to drive all on its own, based on a number of different inputs.

Machine learning is typically divided into two areas: supervised and unsupervised learning. This series is going to start with supervised learning, and then move into unsupervised learning at a later date. Let’s get started.

Supervised Learning & Linear Regression

Supervised learning involves feeding the correct inputs and outputs to an algorithm so that it can predict values outside of its given dataset. This is learning part of supervised learning. We give it data, so that it can use that data to learn patterns which it can then use to make predictions. An example of this would be to look at this graph of a persons height in cm, versus their weight in kg:


As you can see in the graph, we have a dataset that matches a person’s height to their weight, which can be analyzed by a programmatic algorithm. In essence, the goal is to generate a function that can take an input (height), and return an output (weight) for any value of a persons height. This particular type of problem is known as a regression problem, because contains a continuous amount of outputs.

As a slight digression, this is in contrast to the other set of problems known as classification problems, which take inputs and then returns a classification (an output from a list of possibilities) as an output. For example, we could take inputs from weather data and attempt to predict whether it will rain or not. The classification here is rain versus no rain. Any who, the point is that it doesn’t have continuous outputs but rather, discrete outputs. Regression on the other hand contains an continuous array of possible outputs.

The outputs for weights could be 70 kg, 70.1kg, 70.11kg, 80 kg, and so forth. There is an infinite amount of values that this function could return. What we would ideally want to do is generate a function whose slope might look something like this:



As you can see, we’ve drawn a line of best fit through the data in a rough manner that gives us a basis to make simple predictions. For example, we can see here that at 200 cm in height, a person could weigh about 103 kg. The algorithm that can generate such a model is known as linear regression with gradient descent (we’ll get into gradient descent below), as we have a continuous line of outputs that follows some useful pattern in the data.

Linear regression with gradient descent is one of the most basic learning algorithms, which is why we’re going to start with it. The next step is to figure out how we can programmatically take (x, y) inputs, which we call our training set, and tune an output algorithm that allows us to make predictions outside of the training set. Such a training set may look like this:


Height (cm) Weight (kg)
150 45
155 52
160 55
165 61
170 65
175 71
180 77
185 83
190 91
195 103


With the corresponding graph looking just like this:



m = number of samples (we’ll use this later). In this case, m = 10

i is the index into our training set. Ex. x(1) = 150, y(1) = 45

Say we wanted to predict, given this data, what someone with a height of 100 cm might weigh. That’s where linear regression once again, comes to the rescue. Now then, we have our training set. What we want, is a function hθ(x) = θ(0) + θ(1)x, where θ(1) is the slope and θ(0) is the y intercept (however, our data doesn’t seem to have a y intercept). Our goal here is to find the most ideal θ(0) and θ(1) values that best match our dataset. We’re going to train our function to do so. The algorithm we’re going to use to train our functions values is known as gradient descent.

The Linear Regression With Gradient Descent Algorithm

First, we’re going to need a method to measure the accuracy of hθ(x), given our dataset. In order to do so, we use what is known as a cost function. This function tells us how well the training is going and how close we are between our actual outputs (calculated) and our expected outputs (in the dataset). For this reason, hθ(x) is known as the hypothesis function. It hypothesizes outputs, while our cost function lets it know how well it’s hypotheses are. For linear regression, we’re going to use a cost function known as the square error function.

At this point in the tutorial, we’re going to start diving into the maths of ML. I’m sorry if you’re uncomfortable with math, but it’s truly the only medium available to formalize exactly what we’re doing and why it works.

With that out of the way, this is the square error function (otherwise known as, our cost function).


As we saw above, m is the number of training examples we have. hθ(x) is our hypothesis function, and the (i) notations denote indexes into the training set. Inside the brackets, we’re subtracting our actual outputs by our expected outputs, squaring them, and then adding them all together. After we have our sum, we multiply it by 1 / 2 (10) or 1/20. That is square error value, and the value of our cost function. If we were to plot this square error function in a graph, it might look something like this:


I didn’t put this graph here to scare you, but rather to show you what gradient descent does and why it works. In essence, we are trying to find a local minimum in the graph of the square error function, which is that blue section. So, lets assume we choose a random point. Gradient descent is an algorithm for descending the graphs gradient until it reaches that local minimum.



The black dot shows our random starting point, and the arrow shows where gradient descent wants to go. Okay, so obviously this a rate of change problem, which means we’re going to need calculus.

The line of best fit that we want has the equation – which again, known as the hypothesis function:

hθ(x) = θ(0) + θ(1)x  (same as y = mx + b from grade 10 math)

What we want to do, is get the partial derivative with respect to θ(0) as well as the partial derivate with respect to θ(1).


You may have noticed the α term in front of the 1/m. α is our learning rate, or how big of a step we want to take as we move closer to the local minimum in our 3D graph above. It’s sort of tricky, because if you make it too small – the algorithm will be horrendously slow. If you make it take too big of a step, you must just overshot the local minimum and then be lost to travel the peaks and crevices you didn’t intend to (making it impossible to find the local minimum). I’ve found a good learning rate to be about 0.001, but you may have to experiment given your dataset.

The gradient descent algorithm involves calculating θ(0) and θ(1) until the slope approaches a reasonably low value. For our case, we’re just going to calculate each about 100 times and assume it’s enough.

Let’s see what happens:

from matplotlib import pyplot as plt

# maps height in cm to weight in kg
dataset = [ [150,45],
			[195,103] ]

x_values = [i[0] for i in dataset]
y_values = [i[1] for i in dataset]

# if you want to graph the points
# plt.plot(x_values, y_values, 'bo')

#initialize theta_0 and theta_1 to 0.0
theta_0 = 0.0
theta_1 = 0.0
# this is alpha
learn_rate = 0.001

# hypothesis function
h = lambda x : theta_0 + (theta_1 * x)

# core of the linear regression algorithm
def summation(h, dataset):

	total_1 = 0
	total_2 = 0
	m = len(dataset)

	# update the totals simultaneously
	for i in dataset:
		total_1 += h(i[0]) - i[1]
		total_2 += (h(i[0]) - i[1]) * i[1]
	# return the totals
	return total_1 * (1.0/m), total_2 * (1.0/m)

# update theta_0 and theta_1
for i in range(100):
	sum1,sum2 = summation(h, dataset)
	# updates happen
	theta_0 = theta_0 - learn_rate * sum2
	theta_1 = theta_1 - learn_rate * sum1

print theta_0 # 38.0987086619 (y-intercept)
print theta_1 # 0.19274089936 (slope)
print h(100)  # 57.3727985979 (predicted value)
print h(170)  # 70.8646615531 (predicted value)

If we look at the predicted values for 100 (~57) and 170 (~71) in the graph above, we see that they are quite accurate – which  means linear regression using gradient descent works!

Closing Remarks

Linear regression using gradient descent can work with any number of variables, not just two. It is an method that allows us to teach computers how to tweak their own algorithm, rather than hardcode the algorithm ourselves. Linear regression is one of the simplest of the machine learning algorithms, as we can tangibly see how it works. This is in contrast to neural networks, which operate as a black box. Again, I’m sorry for the graphs and the maths, but in this subject field, it really is a necessity when it comes to grasping what it is we’re actually doing. I hope you all learned a thing or two as I did.

Cheers and happy coding!