20 Minute Machine Learning Crash Course

You wouldn’t walk out of a store without checking out, would you?

That’s illegal! Right?

At the brand new Climate Pledge Arena in Seattle, Amazon debuted their Just Walk Out cashierless technology to enable fans to get out of their store and back in their seats as quickly as possible.

Credit: https://www.seattlepi.com/local/seattlenews/article/inside-climate-pledge-arena-before-kraken-debut-16549897.php

This impressive system by Amazon is a recent application of artificial intelligence, one of the most promising emerging technologies that will have (and already is having) a major impact on our world.

If you are inspired by this or other applications of AI to utilize the technology in your own way, you have to start somewhere. This guide will help you get started with the technical side of AI, starting from defining what a neural network exactly is to finding ways to optimize training of a neural network.

Images in this guide are from this free course on Udacity. Feel free to check it out!


  1. What Are Neural Networks?
  2. Perceptrons
  3. Error Functions
  4. Logistic Regression
  5. Neural Network Architecture
  6. Feedforward and Backpropagation
  7. Training Optimization

What Are Neural Networks?

Neural networks are tools that can be used to solve classification problems. Given an image of a dish, classify it as a pancake or a waffle. Given a handwritten number, classify it as a digit from 0–9.

In the above graph, we are trying to predict whether or not a student gets accepted into a particular university based on their grades and test scores. In most cases, a point above/to the right of the line would be accepted while a point below/to the left of the line would not be accepted.

With neural networks, we can try to create the appropriate line that will accurately predict whether or not a student gets accepted into the university.

The equation of this particular line is:

2x₁+ x₂ - 18= 0

More generally, the equation of a line is:

w₁x₁ + w₂x₂ + b = 0

Taking W as the vector <w₁, w₂> and x as the vector <x₁, x₂>, the line can be written as:

Wx + b = 0

We can label each of these points with a 0 if the student is rejected (depicted as red) or a 1 if the student is accepted (depicted as blue). We’ll store the label in the variable y. So, each point can be represented by (x₁, x₂, y).

Our neural network will be predicting whether or not the student gets accepted. Its prediction will be represented by the variable ŷ. It equals 1 if:

Wx + b ≥ 0 (the point is on or to the right of the line)

It equals 0 if:

Wx + b < 0 (the point is to the left of the line)

Our goal is to make ŷ resemble y as closely as possible. In other words, finding the line that keeps most of the blue points above it and the red points below it. How do we do this?


A perceptron is the basic unit of a neural network. On the left hand side of the perceptron, we have inputted data. The edges of the perceptron have weights that are multiplied to the inputs. The last edge has a value called the bias.

The bigger node (circle) in the middle of the above photo calculates the sum of each of the products (inputs * weights). If the sum is greater than or equal to zero, the perceptron outputs a 1. If the sum is less than or equal to zero, the perceptron outputs a 0.

In reality, this perceptron passes the value of the summation (shown in the Linear Function node) into a step function. The step function returns 1 if the sum is greater than or equal to 0 while it returns 0 is the sum is less than 0. The step function is illustrated in the second large node in the photo above.

Perceptron Algorithm

Now, we want to use perceptrons in order to split data (e.g. determining which students get accepted and which get rejected). Again, we want to be able to accurately find the line that splits the data correctly.

There is an algorithm, called the Perceptron Algorithm, is able to do exactly this! But before I introduce this algorithm, there is one thing you should know.

Say we have a misclassified point, as shown below.

If we want the line to be pulled closer to the point, we can simply subtract the coordinates of the point (the 1 shown in the subtraction represents the bias value of 1) to get a new line!

This will drastically change the equation of the line, which means correctly classified points will now becomes misclassified.

Introducing a learning rate will slow down the process. We multiply the learning rate by the coordinates of the point and then subtract to get a new line.

Now, we can discuss the Perceptron Algorithm. The algorithm goes as follows:

The algorithm is written below in Python (where there are two weights).

def perceptronStep(X, y, W, b, learn_rate = 0.01):
for i in range(len(X)):
y_hat = prediction(X[i],W,b)
if y[i]-y_hat == 1:
W[0] += X[i][0]*learn_rate
W[1] += X[i][1]*learn_rate
b += learn_rate
elif y[i]-y_hat == -1:
W[0] -= X[i][0]*learn_rate
W[1] -= X[i][1]*learn_rate
b -= learn_rate
return W, b

Unfortunately, in some cases, the Perceptron Algorithm is just too simple.

As you can see, in this example, the data is separated by a curve, not a line. What do we do about this?

Error Functions

From now on, we can solve our problems with the help of error functions. Error functions simply tell us how far away we are from the solution.

Discrete to Continuous

But before I talk about error functions, we need to shift our data from discrete to continuous. By this, I mean that instead of representing data as either yes or no, we represent it as 63% likely, for example.

Because of this shift from discrete to continuous data, we can change our activation function from the step function to the sigmoid function. Whereas the step function returned a 1 if the linear function’s value was positive, the sigmoid function behaves as follows:

Graph of the sigmoid function.

The sigmoid function is defined as follows:

Formula for the sigmoid function.

The follow image shows the difference between our old perceptron (using the step function) and the sigmoid function.


Beforehand, we were trying to classify data into two groups (accepted or rejected). However, we may also want to classify data into more groups! We can do this with the softmax function.

In order to define the softmax function, say we have x classes. For example, we could want to classify data as different flavors of ice cream: chocolate, vanilla, cookie dough, you name it.

Our linear model will give the following scores: Z₁, Z₂, …, Zₓ. According to the softmax function, the probability that the object is in class i is:

Formula for the softmax function.

In other words, divide e raised to the score of class i by the sum of e raised to all of the other scores. In code, it looks as follows:

def softmax(L):
expL = np.exp(L)
sumExpL = sum(expL)
result = []
for i in expL:
return result


That’s great! Now, we can discuss cross-entropy, which is a way to determine how good or bad a model is. A good model has a low cross-entropy while a bad model has a high cross-entropy.

To define cross-entropy, let’s imagine we have three doors. Each door may or may not have a gift behind it. Putting the probabilities in a table yields:

The following table shows each case and its probability. To find the probability that a given case occurs, we simply multiply the probabilities of each individual event happening (since the events are independent of each other). For example, the probability of each door containing a gift is the probability of the green door having a gift times the probability of the red door having a gift times the probability of the blue door having a gift.

The cross-entropy of a given case is simply the negative of the natural logarithm of the probability.

Thanks to properties of logarithms, we can rewrite the cross-entropy in terms of each individual event. For example, say we take case 2 of the table (gift, gift, no gift). Its cross-entropy can be written as -ln(0.8) - ln(0.7) - ln(0.9).

We can introduce the variable yᵢ as the actual value of the door. It will equal 1 if there is a present behind door i and it will equal 0 if there is no present behind door i.

In the above case, y₁ = 1, y₂ = 1, and y₃ = 0. Finally, we can reveal the formula for cross-entropy. The cross-entropy is defined as follows:

Formula for cross-entropy.

This formula may look intimidating, but all it really just is the sum of the negative of the logarithms! Do you see it?

If there is a present, yᵢ = 1. This means that the term inside the sigma is just ln(pᵢ). If there is not a present, yᵢ = 0. This means that the term inside the sigma is just ln(1- pᵢ).

My code for the cross-entropy formula is as follows:

def cross_entropy(Y, P):
sum = 0
for i in range(len(Y)):
sum = sum - (Y[i]*np.log(P[i]) + (1 - Y[i])*np.log(1-P[i]))
return sum

By convention, the error function of a model takes the average, not just the sum. It looks like this (replacing p with ŷ):

Since ŷ is the output of the sigmoid function, we can write the error function as:

Now, we can talk about logistic regression and gradient descent.

Logistic Regression

Logistic Regression is one of the most useful algorithms in machine learning. It goes as follows:

  • Take your data
  • Pick a random model
  • Calculate the error
  • Minimize the error, and obtain a better model

We just learned how to calculate the error. There’s one step left. How do we minimize the error? We can use gradient descent.

Gradient Descent

This is gonna get a bit math-heavy, so I hope you’re ready! Let’s do this!

We want to take the negative of the gradient of the error function (think about this like taking the negative of the change of the error function). We can keep on doing this until the error goes from some value to finally reaching zero.

Here is how we calculate the gradient.

The derivative of the sigmoid function is: σ′(x)=σ(x)(1−σ(x))

That is a very nice derivative. If you don’t believe it, we can get to it using the quotient rule:

Recall that the error formula is:

To calculate the gradient of E, we take the partial derivatives of the error with respect to each of the weights and bias. All this means is that we take the derivative of the error function with respect of one weight (or the bias) at a time.

We’ll calculate the derivative of the error each point produces and then take the average to get the total error. The error produced by each point is:

Because of the chain rule, we’ll need to take the partial derivative of ŷ. Remember that Recall that ŷ = σ(Wx + b). Therefore:

Now, the derivative of the error at a point is:

In addition:

This means that for a point with the following coordinates:

The gradient of the error function at that point is:

This means that the gradient is:

This result indicates that the closer ŷ is to y, the smaller the gradient, and vice versa. So, a small gradient means we’ll change our coordinates by a little bit, and a large gradient means we’ll change our coordinates by a lot.

Since the gradient descent step simply consists in subtracting a multiple of the gradient of the error function at every point, then this updates the weights in the following way:

Similarly, the bias is changed as follows:

Since we’ve taken the average of the errors, the term we are adding should be 1/m⋅α instead of α. Since α is a constant, in order to simplify calculations, we’ll just take 1/m⋅α to be our learning rate, and just call it α.

Finally, this is how key components gradient descent can be implemented in code:

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def output_formula(features, weights, bias):
return sigmoid(np.dot(features, weights) + bias)

def error_formula(y, output):
return - y*np.log(output) - (1 - y) * np.log(1-output)

def update_weights(x, y, weights, bias, learnrate):
output = output_formula(x, weights, bias)
d_error = y - output
weights += learnrate * d_error * x
bias += learnrate * d_error
return weights, bias

We now have a bunch of building blocks. It’s time to put them together and create a Multi-Layer Perceptron, also known as a Neural Network.

Neural Network Architecture

The college admissions data set we’ve been working with looks fairly easy to split with a line. However, what if we need to split a more complicated data set? Take a look at this graph:

As you can see, the red and blue regions are split by a curve, not a line. What do we do now?

Well, we can actually combine two linear models into a non-linear model.

Do you see how if we could find a way to add the two above linear models together, they would form the model on the right? Fortunately, there is a way we can do this mathematically.

At every point on the graph, we are given the probability that the point is blue.

Since in both of these models, the highlighted point is in the blue region, the probability of it being blue is quite high.

If we wanted to combine these two probabilities (and the probabilities of any given point, of course), we could start with a simple addition.

Well, wait a second. Probabilities need to be between 0 and 1. Clearly, the sum of 1.5 is not a probability. We can transform it into a probability with the sigmoid function.

After applying the sigmoid function, we get a new probability of 0.82. This makes sense, as the probabilities in the linear models were both large, so combining them also yielded a large probability. Again, 0.82 is the probability that the highlighted point is blue in the new, non-linear model.

Now that’s great and all, but what if we wanted to weight the sum. In other words, what if we wanted one of the linear models to be more influential than the other in creating the resulting non-linear model?

We can give each of the models weights. Each probability in the linear model will be multiplied by the weight before being summed and inputted into the sigmoid function.

We can also introduce a bias if we wanted to. If we had a bias of -6, we would subtract 6 from the sum before inputting the value into the sigmoid function.

The image below shows what this process looks like:

Notice how this process looks a lot like a perceptron.

We’re taking some “values”, multiplying by weights, and adding a bias in both cases. This is no coincidence. This process is at the root of building neural networks.

Let’s say we have the following models and representations as perceptrons:

We want to combine the models with the first one having a weight of 7, the second having a weight of 5, and a bias of -6, just like before. This is what that would look like:

To see the magic happen, let’s connect the perceptrons together:

We get a neural network! After cleaning it up a bit, we obtain:

Well that looks a bit more confusing. What exactly does this image show? Well, the weights on the left tell us what equations the linear model actually has. The weights on the right tell us what the linear combination is of the two models.

For example, in the above picture, the x₁ has weights of 7 and 5. As you can see from above, the coefficient of x₁ in the equations is 7 and 5.

Some neural networks are much more complicated than this one. We can add more nodes to the input, hidden, or output layers or even add more layers entirely!

Here is the network we just designed:

A deep neural network is a neural network that has multiple hidden layers. Here is an example of one:

Now, let’s talk about feedforward and backpropagation.

Feedforward and Backpropagation


Let’s say we had the following point and perceptron (w₁ is bigger than w₂, which is why the edge with weight w₁ is shaded thicker):

The perceptron plots the point and it outputs the probability that the point is blue. Clearly, our point is blue, but since it lies in the red region, the probability that it is blue is quite small. This process is known as feedforward.

This process works the same with more complicated neural networks, the only difference being that the model will output the probability of the output layer.

Look at the following model:

To get the predicted probability with feedforward, the model will do the following mathematically:

As you can see, it starts with the vector of inputs (including the bias) and multiplies it by the matrix of weights connecting the inputs to the first hidden layer. It then applies the sigmoid function to complete the first set of perceptrons. Then it gets multiplied by the next matrix of weights and the sigmoid function is applied one more time.

This can be written as follows:


Now that we know what feedforward is, it’s time to delve into how we train a neural network: backpropagation.

The basic steps are as follows:

  • Do a feedforward operation.
  • Compare the output of the model with the desired output.
  • Calculate the error.
  • Run the feedforward operation backwards (backpropagation) to spread the error to each of the weights.
  • Use this to update the weights, and get a better model.
  • Continue this until we have a model that is good.

I’m only going to discuss the concept of backpropagation here. With regard to the math behind it, when actually implementing it, the math is done behind the scenes and you don’t really have to worry about it.

In the above example, the blue point wants the line to be pulled closer to it so it can enter the blue region. We did this by updating the weights.

We are minimizing the error in the same way we did for gradient descent.

Imagine we have the following scenario:

Our error is high, since the model predicts that the point is red when it actually is blue.

To classify it better, the blue point in the final model would like the curve to be pulled closer to it. But what exactly does that mean?

Well, which of the two linear models is doing better? Since the bottom one classifies it correctly while the top does not, I would say that the bottom one is. Thus, we should reduce the weight coming from the top model and increase the weight from the bottom one.

We can actually do even more. We can try to make the linear models classify the points better. In the top model, we want the line to move left. In the bottom model, we want the line to move down so it can be even more confident in its classification.

Here is an illustration of what these changes may look like:

At this point, all of the steps to train a neural network have been laid out. However, sometimes training can go wrong. We can choose the wrong architecture. The data itself may have problems. Maybe the model takes too long to run. Now, we need to figure out how we can optimize our training.

Training Optimization

When creating a model, we need two groups of data: training data and testing data. We train our model with the training data without looking at the testing data. Then we use the testing data to evaluate how the model does.

Overfitting and Underfitting

Let’s say we have the following data set:

Can you see how the labels of “Dogs” and “Not Dogs” make sense here? Well, what if we tried to label the set like this:

Well, this doesn’t quite work. It’s too simple. The cat is now misclassified. We call this underfitting. Let’s try again with different labels:

Okay, this one seems to work. But it is very specific. What if our testing set included a purple dog?

Now this is an incorrect classification. A good classifier should put this dog on the right, but this one would put it on the left. Because of how specific this is, this is an example of overfitting.

In the above models, do you see how the left is an example of underfitting while the right is overfitting? The left network is way too simple while the right is unnecessarily complicated. We need to strike a balance between the two, which is exactly what the middle model accomplishes.

Early Stopping

Notice how in the above image, the more epochs (times our model passes through the entire data set) we have, the training error (solid curve) continuously decreases while the testing error (dashed curve) decreases and then increases.

There is a perfect number of epochs we should train the model for to keep both errors small.

This Model Complexity Graph shows the number of epochs we should be training the model for to not underfit or overfit.

We do gradient descent until the testing error stops decreasing and starts to increase. At that moment, we stop. This algorithm is called Early Stopping.


In general, large weights tend to lead to overfitting. Regularization is a solution to this problem because it penalizes large weights.

Here was our old error function:

What we want to do is add a term to the error function this is big when the weights are big. There are two ways of doing this.

One way is to add the sums of the absolute values of the weights times a constant, λ. The constant λ tells us how much we want to penalize the weights. If λ is large, we penalize them a lot, and vice versa. This is called L1 Regularization.

A similar method adds the sums of the squares of the weights times λ. This is called L2 Regularization.

These are both popular methods. We can apply one or the other depending on which works better in a given situation.


Imagine if you go to the gym every day and just work out with your right arm. By the end of training, your right arm would be very strong, but your left would be very weak.

In neural networks, some weights can be very strong and dominate the training process. Smaller weights may not really change the output much, so they don’t really get trained.

To solve this, we may shut off a strong part of the network and let the other parts train. More formally, as we go through the epochs, we can randomly turn off some of the nodes of the network, saying that the model should not pass through those nodes. In this case, the other nodes have to pick up the slack and participate more in the training. This method is called dropout.

Here is what this may look like (red nodes mean they have been shut off):

Random Restart

With gradient descent, we are trying to find the minimum of the error function. However, we may get stuck at a local minimum and not the absolute minimum. This is show below:

With random restart, we can start from a few random points on the function and do gradient descent from all of them. This increases the probability that we get to the absolute minimum.

Other Activation Functions

We have discussed the sigmoid function in depth, but there are also other activation functions that may be better in different cases. Here is the hyperbolic tangent function:

And here is the rectified linear unit:

Learning Rate Decay

One final point I would like to mention is the question of choosing a learning rate. It is not clear what the best learning rate is. However, a general rule of thumb is that if your model is not working, decrease the learning rate. This illustration should help to demonstrate why:

With a high learning rate, we can easily completely skip over the local minimum. It is much harder to make this mistake with a slower descent that a lower learning rate would provide.

In fact, the best learning rates are those that decrease as the model gets closer to a solution. Keras, a Python library, has options that let you do this.

Thank you so much for reading! I hope you learned something of value from this guide and are ready to take the next step in your AI journey.

If you’re ready to start building models yourself, I highly recommend the course I linked in the introduction. For convenience, I’ll leave it right here.

If you’d like, I’d greatly appreciate if you could check out my LinkedIn, newsletter, Twitter, and YouTube. Again, thanks so much for reading!



Sterling Kalogeras

Sterling Kalogeras

18-year-old innovator with a love for computer science, math, and government.