Gradient Descent is an optimization algorithm commonly used in machine learning to optimize a Cost Function or Error Function by updating the parameters of our models. These parameters refer to coefficients in Linear Regression and weights in Neural Network.
Let’s understand how Gradient Descent works by taking a practical example and applying the gradient descent optimization on it to reduce the cost function and improve the accuracy.
Machine Learning Model
Suppose we have few data points on a 2D plane referring to how the price of a house changed over few years and we want to predict some kind of relationship between the year and price such that given any random year, our Model can predict what will the price of a house in that year.
Since these data points distribution is completely random hence, we don’t have any direct formula where if we provide the value of X(year), it will result in y(price in that year), so we will use machine learning (which will create patterns) to predict the output for a new set of inputs.
Let’s draw an arbitrary line that passes through some of these data points. But, what type of Line?
We will draw a straight line that will connect as many data points as possible. This is done using simple linear regression. The equation of this line will be Y = mX + c as shown in the diagram below.
Our model above will now give an output corresponding to an input value. But during the training phase, there are chances that the Predicted output will not be the same as the actual output and this is known as an error.
Error = Y'(Predicted) - Y(Actual)
For our model to work properly, we have to minimize this error, and to do this we will use Cost Function.
Cost Function/ Error function tells us how good is our model at making predictions for the given dataset. In other words, it tells us how far away is our predicted line from the actual point. If you see the above graph, the predicted line using Linear Regression doesn’t touch every data point but passes near to them, and how far is this line(distance) from the actual point is an error.
Let’s say we have N data points, then the cost function will be Sum of squared error mean. We take squares of errors because it makes the error function positive and smooth (differentiable).
Gradient Descent is the most popular iterative optimization algorithm for finding Local Minima of a differentiable function.
Suppose we have a function f and its derivative is f’ and we are currently at point t then –
- if f'(t)<0 and we take steps proportional to negative gradient then it will lead us to local minima
- if f'(t)>0 and we take steps proportional to positive gradient then it will lead us to local maxima
If we closely see, our cost function is of the type Y = X², (where Y = Error & X = (Y’ – Y)) and this is an equation of a parabola. The graph for such equation will look like –
If the slope is positive, then we are moving away from local minima and if the slope is -ve then we are moving towards the minima.
Types of Gradient Descent
There are 3 types of gradient Descent –
- Batch Gradient Descent – It processes all training examples for each iteration. We take the average of the gradients of all the training examples and then use that mean gradient to update our parameters after every iteration or epoch(in machine learning term).
- Stochastic Gradient Descent – It processes 1 training example per iteration or epoch. SGD can be used for larger datasets. It converges faster when the dataset is large as it causes updates to the parameters more frequently, but number of iteration becomes too much.
- Mini Batch Gradient Descent – Suppose we have n training examples, the Mini Batch Gradient Descent(MBGD) works by taking a batch b of examples such that b < n and calculate the gradients of b examples in one iteration and next b in next iteration until all examples are not processed. (Note is b==n, the MBGD as good as batch gradient descent.)
Termination Condition for Gradient Descent
There are many stopping rules for Gradient Descent, let’s talk about one the stopping function optim. It has 3 stopping rules –
- maxit – Run for a predetermined maximum number of iterations.
- reltol – It stops when improvement drops below a threshold.
- abstol – stop when the function gets close enough to zero.
The learning rate or step size tells us how quickly or how slow the is model at learning patterns. If we choose the large learning rate, then more areas will be covered in a single step and this can lead to overshooting the minima and a small learning rate will take a very long time to reach the minima . Choosing the correct learning rate is known as hyperparameter tuning and most of the time is spend in this because this is a hit and trial method.
Optimizing Cost function using Gradient Descent
Remember the Cost function derived above. We will calculate the derivative of the cost function w.r.t m & c. If we process each error one by one, we can get rid of the summation. Now, the derivation of cost function w.r.t m and c will look something like this: Now, let’s calculate the gradient of error term with respect to m (weight) and c(bias). The result of derivate of error w.r.t m will be X and w.r.t c will be 1. This new gradient tells us the slope of our cost function at our current position (current parameter values) and the direction we should move to update our parameters. The size of our update is controlled by the learning rate. Let’s put these values back in the cost function and multiply it with the learning rate. The resulting function will look something like this. Also, 2 can be eliminated as it just says the learning rate is twice big or half of the actual learning rate, and while parameter we can choose the learning rate in such a way that constant effect is mitigated. That right there in the blue box is the actual gradient descent equation. These equations are for a single step, hence we will iterate updating the values of m & c until we don’t reach closer to the local minima.
But how will updating Weights(m) and Bias(c) make my result more accurate and reduce errors?
As I told you at the beginning of the post, given a dataset, we have drawn an arbitrary straight line, but we don’t know whether that line is the most accurate one. Perhaps, if the same line drawn differently might be possible we get more accurate results and fewer errors. Hence we constantly update the slope(m) and the y-intercept(c) of the arbitrary line to find out which straight line will give us fewer errors and better results. Note, by updating the m and c of a line, we are updating the slope and y-intercept.