The post Optimization for Machine Learning Crash Course appeared first on Machine Learning Mastery.

]]>Find function optima with Python in 7 days.

All machine learning models involve optimization. As a practitioner, we optimize for the most suitable hyperparameters or the subset of features. Decision tree algorithm optimize for the split. Neural network optimize for the weight. Most likely, we use computational algorithms to optimize.

There are many ways to optimize numerically. SciPy has a number of functions handy for this. We can also try to implement the optimization algorithms on our own.

In this crash course, you will discover how you can get started and confidently run algorithms to optimize a function with Python in seven days.

This is a big and important post. You might want to bookmark it.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

Let’s get started.

Before we get started, let’s make sure you are in the right place.

This course is for developers that may know some applied machine learning. Perhaps you have built some models and did some projects end-to-end, or modified from existing example code from popular tools to solve your own problem.

The lessons in this course do assume a few things about you, such as:

- You know your way around basic Python for programming.
- You may know some basic NumPy for array manipulation.
- You heard about gradient descent, simulated annealing, BFGS, or some other optimization algorithms and want to deepen your understanding.

You do NOT need to be:

- A math wiz!
- A machine learning expert!

This crash course will take you from a developer who knows a little machine learning to a developer who can effectively and competently apply function optimization algorithms.

Note: This crash course assumes you have a working Python 3 SciPy environment with at least NumPy installed. If you need help with your environment, you can follow the step-by-step tutorial here:

This crash course is broken down into seven lessons.

You could complete one lesson per day (recommended) or complete all of the lessons in one day (hardcore). It really depends on the time you have available and your level of enthusiasm.

Below is a list of the seven lessons that will get you started and productive with optimization in Python:

**Lesson 01**: Why optimize?**Lesson 02**: Grid search**Lesson 03**: Optimization algorithms in SciPy**Lesson 04**: BFGS algorithm**Lesson 05**: Hill-climbing algorithm**Lesson 06**: Simulated annealing**Lesson 07**: Gradient descent

Each lesson could take you 60 seconds or up to 30 minutes. Take your time and complete the lessons at your own pace. Ask questions, and even post results in the comments below.

The lessons might expect you to go off and find out how to do things. I will give you hints, but part of the point of each lesson is to force you to learn where to go to look for help with and about the algorithms and the best-of-breed tools in Python. (**Hint**: *I have all of the answers on this blog; use the search box*.)

**Post your results in the comments**; I’ll cheer you on!

Hang in there; don’t give up.

In this lesson, you will discover why and when we want to do optimization.

Machine learning is different from other kinds of software projects in the sense that it is less trivial on how we should write the program. A toy example in programming is to write a for loop to print numbers from 1 to 100. You know exactly you need a variable to count, and there should be 100 iterations of the loop to count. A toy example in machine learning is to use neural network for regression, but you have no idea how many iterations you need exactly to train the model. You might set it too few or too many and you don’t have a rule to tell what is the right number. Hence many people consider machine learning models as a **black box**. The consequence is that, while the model has many variables that we can tune (the hyperparameters, for example) we do not know what should be the correct values until we tested it out.

In this lesson, you will discover why machine learning practitioners should study optimization to improve their skills and capabilities. Optimization is also called function optimization in mathematics that aimed to locate the maximum or minimum value of certain **function**. For different nature of the function, different methods can be applied.

Machine learning is about developing predictive models. Whether one model is better than another, we have some evaluation metrics to measure a model’s performance subject to a particular data set. In this sense, if we consider the parameters that created the model as the input, the inner algorithm of the model and the data set in concern as constants, and the metric that evaluated from the model as the output, then we have a function constructed.

Take decision tree as an example. We know it is a binary tree because every intermediate node is asking a yes-no question. This is constant and we cannot change it. But how deep this tree should be is a hyperparameter that we can control. What features and how many features from the data we allow the decision tree to use is another. A different value for these hyperparameters will change the decision tree model, which in turn gives a different metric, such as average accuracy from k-fold cross validation in classification problems. Then we have a function defined that takes the hyperparameters as input and the accuracy as output.

From the perspective of the decision tree library, once you provided the hyperparameters and the training data, it can also consider them as constants and the selection of features and the thresholds for split at every node as input. The metric is still the output here because the decision tree library shared the same goal of making the best prediction. Therefore, the library also has a function defined, but different from the one mentioned above.

The **function** here does not mean you need to explicitly define a function in the programming language. A conceptual one is suffice. What we want to do next is to manipulate on the input and check the output until we found the best output is achieved. In case of machine learning, the best can mean

- Highest accuracy, or precision, or recall
- Largest AUC of ROC
- Greatest F1 score in classification or R
^{2}score in regression - Least error, or log-loss

or something else in this line. We can manipulate the input by random methods such as sampling or random perturbation. We can also assume the function has certain properties and try out a sequence of inputs to exploit these properties. Of course, we can also check all possible input and as we exhausted the possibility, we will know the best answer.

These are the basics of why we want to do optimization, what it is about, and how we can do it. You may not notice it, but training a machine learning model is doing optimization. You may also explicitly perform optimization to select features or fine-tune hyperparameters. As you can see, optimization is useful in machine learning.

For this lesson, you must find a machine learning model and list three examples that optimization might be used or might help in training and using the model. These may be related to some of the reasons above, or they may be your own personal motivations.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will discover how to perform grid search on an arbitrary function.

In this lesson, you will discover grid search for optimization.

Let’s start with this function:

*f* (*x*, *y*) = *x*^{2} + *y*^{2}

This is a function with two-dimensional input (*x*, *y*) and one-dimensional output. What can we do to find the minimum of this function? In other words, for what *x* and *y*, we can have the least *f* (*x*, *y*)?

Without looking at what *f* (*x*, *y*) is, we can first assume the *x* and *y* are in some bounded region, say, from -5 to +5. Then we can check for every combination of *x* and *y* in this range. If we remember the value of *f* (*x*, *y*) and keep track on the least we ever saw, then we can find the minimum of it after exhausting the region. In Python code, it is like this:

from numpy import arange, inf # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -5.0, 5.0 # generate a grid sample from the domain sample = list() step = 0.1 for x in arange(r_min, r_max+step, step): for y in arange(r_min, r_max+step, step): sample.append([x,y]) # evaluate the sample best_eval = inf best_x, best_y = None, None for x,y in sample: eval = objective(x,y) if eval < best_eval: best_x = x best_y = y best_eval = eval # summarize best solution print('Best: f(%.5f,%.5f) = %.5f' % (best_x, best_y, best_eval))

This code scan from the lowerbound of the range -5 to upperbound +5 with each step of increment of 0.1. This range is same for both *x* and *y*. This will create a large number of samples of the (*x*, *y*) pair. These samples are created out of combinations of *x* and *y* over a range. If we draw their coordinate on a graph paper, they form a grid, and hence we call this grid search.

With the grid of samples, then we evaluate the objective function *f* (*x*, *y*) for every sample of (*x*, *y*). We keep track on the value, and remember the least we ever saw. Once we exhausted the samples on the grid, we recall the least value that we found as the result of the optimization.

For this lesson, you should lookup how to use numpy.meshgrid() function and rewrite the example code. Then you can try to replace the objective function into *f* (*x*, *y*, *z*) = (*x* – *y* + 1)^{2} + *z*^{2}, which is a function with 3D input.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will learn how to use scipy to optimize a function.

In this lesson, you will discover how you can make use of SciPy to optimize your function.

There are a lot of optimization algorithms in the literature. Each has its strengths and weaknesses, and each is good for a different kind of situation. Reusing the same function we introduced in the previous lesson,

*f* (*x*, *y*) = *x*^{2} + *y*^{2}

we can make use of some predefined algorithms in SciPy to find its minimum. Probably the easiest is the Nelder-Mead algorithm. This algorithm is based on a series of rules to determine how to explore the surface of the function. Without going into the detail, we can simply call SciPy and apply Nelder-Mead algorithm to find a function’s minimum:

from scipy.optimize import minimize from numpy.random import rand # objective function def objective(x): return x[0]**2.0 + x[1]**2.0 # define range for input r_min, r_max = -5.0, 5.0 # define the starting point as a random sample from the domain pt = r_min + rand(2) * (r_max - r_min) # perform the search result = minimize(objective, pt, method='nelder-mead') # summarize the result print('Status : %s' % result['message']) print('Total Evaluations: %d' % result['nfev']) # evaluate solution solution = result['x'] evaluation = objective(solution) print('Solution: f(%s) = %.5f' % (solution, evaluation))

In the code above, we need to write our function with a single vector argument. Hence virtually the function becomes

*f* (*x*[0], *x*[1]) = (*x*[0])^{2} + (*x*[1])^{2}

Nelder-Mead algorithm needs a starting point. We choose a random point in the range of -5 to +5 for that (rand(2) is numpy’s way to generate a random coordinate pair between 0 and 1). The function minimize() returns a OptimizeResult object, which contains information about the result that is accessible via keys. The “message” key provides a human-readable message about the success or failure of the search, and the “nfev” key tells the number of function evaluations performed in the course of optimization. The most important one is “x” key, which specifies the input values that attained the minimum.

Nelder-Mead algorithm works well for **convex functions**, which the shape is smooth and like a basin. For more complex function, the algorithm may stuck at a **local optimum** but fail to find the real global optimum.

For this lesson, you should replace the objective function in the example code above with the following:

from numpy import e, pi, cos, sqrt, exp def objective(v): x, y = v return ( -20.0 * exp(-0.2 * sqrt(0.5 * (x**2 + y**2))) - exp(0.5 * (cos(2 * pi C *x)+cos(2*pi*y))) + e + 20 )

This defined the Ackley function. The global minimum is at v=[0,0]. However, Nelder-Mead most likely cannot find it because this function has many local minima. Try repeat your code a few times and observe the output. You should get a different output each time you run the program.

Post your answer in the comments below. I would love to see what you come up with.

In the next lesson, you will learn how to use the same SciPy function to apply a different optimization algorithm.

In this lesson, you will discover how you can make use of SciPy to apply BFGS algorithm to optimize your function.

As we have seen in the previous lesson, we can make use of the minimize() function from scipy.optimize to optimize a function using Nelder-Meadd algorithm. This is the simple “pattern search” algorithm that does not need to know the derivatives of a function.

First-order derivative means to differentiate the objective function once. Similarly, second-order derivative is to differentiate the first-order derivative one more time. If we have the second-order derivative of the objective function, we can apply the Newton’s method to find its optimum.

There is another class of optimization algorithm that can approximate the second-order derivative from the first order derivative, and use the approximation to optimize the objective function. They are called the **quasi-Newton methods**. BFGS is the most famous one of this class.

Revisiting the same objective function that we used in previous lessons,

*f* (*x*, *y*) = *x*^{2} + *y*^{2}

we can tell that the first-order derivative is:

∇*f* = [2*x*, 2*y*]

This is a vector of two components, because the function *f* (*x*, *y*) receives a vector value of two components (*x*, *y*) and returns a scalar value.

If we create a new function for the first-order derivative, we can call SciPy and apply the BFGS algorithm:

from scipy.optimize import minimize from numpy.random import rand # objective function def objective(x): return x[0]**2.0 + x[1]**2.0 # derivative of the objective function def derivative(x): return [x[0] * 2, x[1] * 2] # define range for input r_min, r_max = -5.0, 5.0 # define the starting point as a random sample from the domain pt = r_min + rand(2) * (r_max - r_min) # perform the bfgs algorithm search result = minimize(objective, pt, method='BFGS', jac=derivative) # summarize the result print('Status : %s' % result['message']) print('Total Evaluations: %d' % result['nfev']) # evaluate solution solution = result['x'] evaluation = objective(solution) print('Solution: f(%s) = %.5f' % (solution, evaluation))

The first-order derivative of the objective function is provided to the minimize() function with the “jac” argument. The argument is named after **Jacobian matrix**, which is how we call the first-order derivative of a function that takes a vector and returns a vector. The BFGS algorithm will make use of the first-order derivative to compute the inverse of the **Hessian matrix** (i.e., the second-order derivative of a vector function) and use it to find the optima.

Besides BFGS, there is also L-BFGS-B. It is a version of the former that uses less memory (the “L”) and the domain is bounded to a region (the “B”). To use this variant, we simply replace the name of the method:

... result = minimize(objective, pt, method='L-BFGS-B', jac=derivative)

For this lesson, you should create a function with much more parameters (i.e., the vector argument to the function is much more than two components) and observe the performance of BFGS and L-BFGS-B. Do you notice the difference in speed? How different are the result from these two methods? What happen if your function is not convex but have many local optima?

Post your answer in the comments below. I would love to see what you come up with.

In this lesson, you will discover how to implement hill-climbing algorithm and use it to optimize your function.

The idea of hill-climbing is to start from a point on the objective function. Then we move the point a bit in a random direction. In case the move allows us to find a better solution, we keep the new position. Otherwise we stay with the old. After enough iterations of doing this, we should be close enough to the optimum of this objective function. The progress is named because it is like we are climbing on a hill, which we keep going up (or down) in any direction whenever we can.

In Python, we can write the above hill-climbing algorithm for minimization as a function:

from numpy.random import randn def in_bounds(point, bounds): # enumerate all dimensions of the point for d in range(len(bounds)): # check if out of bounds for this dimension if point[d] < bounds[d, 0] or point[d] > bounds[d, 1]: return False return True def hillclimbing(objective, bounds, n_iterations, step_size): # generate an initial point solution = None while solution is None or not in_bounds(solution, bounds): solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point solution_eval = objective(solution) # run the hill climb for i in range(n_iterations): # take a step candidate = None while candidate is None or not in_bounds(candidate, bounds): candidate = solution + randn(len(bounds)) * step_size # evaluate candidate point candidte_eval = objective(candidate) # check if we should keep the new point if candidte_eval <= solution_eval: # store the new point solution, solution_eval = candidate, candidte_eval # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval]

This function allows any objective function to be passed as long as it takes a vector and returns a scalar value. The “bounds” argument should be a numpy array of *n*×2 dimension, which *n* is the size of the vector that the objective function expects. It tells the lower- and upper-bound of the range we should look for the minimum. For example, we can set up the bound as follows for the objective function that expects two dimensional vectors (like the one in the previous lesson) and the components of the vector to be between -5 to +5:

bounds = np.asarray([[-5.0, 5.0], [-5.0, 5.0]])

This “hillclimbing” function will randomly pick an initial point within the bound, then test the objective function in iterations. Whenever it can find the objective function yields a less value, the solution is remembered and the next point to test is generated from its neighborhood.

For this lesson, you should provide your own objective function (such as copy over the one from previous lesson), set up the “n_iterations” and “step_size” and apply the “hillclimbing” function to find the minimum. Observe how the algorithm finds a solution. Try with different values of “step_size” and compare the number of iterations needed to reach the proximity of the final solution.

Post your answer in the comments below. I would love to see what you come up with.

In this lesson, you will discover how simulated annealing works and how to use it.

For the non-convex functions, the algorithms you learned in previous lessons may be trapped easily at local optima and failed to find the global optima. The reason is because of the greedy nature of the algorithm: Whenever a better solution is found, it will not let go. Hence if a even better solution exists but not in the proximity, the algorithm will fail to find it.

Simulated annealing try to improve on this behavior by making a balance between *exploration* and *exploitation*. At the beginning, when the algorithm is not knowing much about the function to optimize, it prefers to explore other solutions rather than stay with the best solution found. At later stage, as more solutions are explored the chance of finding even better solutions is diminished, the algorithm will prefer to remain in the neighborhood of the best solution it found.

The following is the implementation of simulated annealing as a Python function:

from numpy.random import randn, rand def simulated_annealing(objective, bounds, n_iterations, step_size, temp): # generate an initial point best = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # evaluate the initial point best_eval = objective(best) # current working solution curr, curr_eval = best, best_eval # run the algorithm for i in range(n_iterations): # take a step candidate = curr + randn(len(bounds)) * step_size # evaluate candidate point candidate_eval = objective(candidate) # check for new best solution if candidate_eval < best_eval: # store new best point best, best_eval = candidate, candidate_eval # report progress print('>%d f(%s) = %.5f' % (i, best, best_eval)) # difference between candidate and current point evaluation diff = candidate_eval - curr_eval # calculate temperature for current epoch t = temp / float(i + 1) # calculate metropolis acceptance criterion metropolis = exp(-diff / t) # check if we should keep the new point if diff < 0 or rand() < metropolis: # store the new current point curr, curr_eval = candidate, candidate_eval return [best, best_eval]

Similar to the hill-climbing algorithm in the previous lesson, the function starts with a random initial point. Also similar to that in previous lesson, the algorithm runs in loops prescribed by the count “n_iterations”. In each iteration, a random neighborhood point of the current point is picked and the objective function is evaluated on it. The best solution ever found is remembered in the variable “best” and “best_eval”. The difference to the hill-climbing algorithm is that, the current point “curr” in each iteration is not necessarily the best solution. Whether the point is moved to a neighborhood or stay depends on a probability that related to the number of iterations we did and how much improvement the neighborhood can make. Because of this stochastic nature, we have a chance to get out of the local minima for a better solution. Finally, regardless where we end up, we always return the best solution ever found among the iterations of the simulated annealing algorithm.

In fact, most of the hyperparameter tuning or feature selection problems are encountered in machine learning are not convex. Hence simulated annealing should be more suitable then hill-climbing for these optimization problems.

For this lesson, you should repeat the exercise you did in the previous lesson with the simulated annealing code above. Try with the objective function *f* (*x*, *y*) = *x*^{2} + *y*^{2}, which is a convex one. Do you see simulated annealing or hill climbing takes less iteration? Replace the objective function with the Ackley function introduced in Lesson 03. Do you see the minimum found by simulated annealing or hill climbing is smaller?

Post your answer in the comments below. I would love to see what you come up with.

In this lesson, you will discover how you can implement gradient descent algorithm.

Gradient descent algorithm is *the* algorithm used to train a neural network. Although there are many variants, all of them are based on **gradient**, or the first-order derivative, of the function. The idea lies in the physical meaning of a gradient of a function. If the function takes a vector and returns a scalar value, the gradient of the function at any point will tell you the **direction** that the function is increased the fastest. Hence if we aimed at finding the minimum of the function, the direction we should explore is the exact opposite of the gradient.

In mathematical equation, if we are looking for the minimum of *f* (*x*), where *x* is a vector, and the gradient of *f* (*x*) is denoted by ∇*f* (*x*) (which is also a vector), then we know

*x*_{new} = *x* – *α *× ∇*f* (*x*)

will be closer to the minimum than *x*. Now let’s try to implement this in Python. Reusing the sample objective function and its derivative we learned in Day 4, this is the gradient descent algorithm and its use to find the minimum of the objective function:

from numpy import asarray from numpy import arange from numpy.random import rand # objective function def objective(x): return x[0]**2.0 + x[1]**2.0 # derivative of the objective function def derivative(x): return asarray([x[0]*2, x[1]*2]) # gradient descent algorithm def gradient_descent(objective, derivative, bounds, n_iter, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # run the gradient descent for i in range(n_iter): # calculate gradient gradient = derivative(solution) # take a step solution = solution - step_size * gradient # evaluate candidate point solution_eval = objective(solution) # report progress print('>%d f(%s) = %.5f' % (i, solution, solution_eval)) return [solution, solution_eval] # define range for input bounds = asarray([[-5.0, 5.0], [-5.0, 5.0]]) # define the total iterations n_iter = 40 # define the step size step_size = 0.1 # perform the gradient descent search solution, solution_eval = gradient_descent(objective, derivative, bounds, n_iter, step_size) print("Solution: f(%s) = %.5f" % (solution, solution_eval))

This algorithm depends on not only the objective function but also its derivative. Hence it may not suitable for all kinds of problems. This algorithm also sensitive to the step size, which a too large step size with respect to the objective function may cause the gradient descent algorithm fail to converge. If this happens, we will see the progress is not moving toward lower value.

There are several variations to make gradient descent algorithm more robust, for example:

- Add a
**momentum**into the process, which the move is not only following the gradient but also partially the average of gradients in previous iterations. - Make the step sizes different for each component of the vector
*x* - Make the step size adaptive to the progress

For this lesson, you should run the example program above with a different “step_size” and “n_iter” and observe the difference in the progress of the algorithm. At what “step_size” you will see the above program not converge? Then try to add a new parameter *β* to the gradient_descent() function as the *momentum weight*, which the update rule now becomes

*x*_{new} = *x* – *α *× ∇*f* (*x*) – *β *× *g*

where *g* is the average of ∇*f* (*x*) in, for example, five previous iterations. Do you see any improvement to this optimization? Is it a suitable example for using momentum?

Post your answer in the comments below. I would love to see what you come up with.

This was the final lesson.

(

You made it. Well done!

Take a moment and look back at how far you have come.

You discovered:

- The importance of optimization in applied machine learning.
- How to do grid search to optimize by exhausting all possible solutions.
- How to use SciPy to optimize your own function.
- How to implement hill-climbing algorithm for optimization.
- How to use simulated annealing algorithm for optimization.
- What is gradient descent, how to use it, and some variation of this algorithm.

**How did you do with the mini-course?**

Did you enjoy this crash course?

**Do you have any questions? Were there any sticking points?**

Let me know. Leave a comment below.

The post Optimization for Machine Learning Crash Course appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Particle Swarm Optimization appeared first on Machine Learning Mastery.

]]>In this tutorial, you will learn the rationale of PSO and its algorithm with an example. After competing this tutorial, you will know:

- What is a particle swarm and their behavior under the PSO algorithm
- What kind of optimization problems can be solved by PSO
- How to solve a problem using particle swarm optimization
- What are the variations of the PSO algorithm

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

**Particle Swarm Optimization** was proposed by Kennedy and Eberhart in 1995. As mentioned in the original paper, sociobiologists believe a school of fish or a flock of birds that moves in a group “can profit from the experience of all other members”. In other words, while a bird flying and searching randomly for food, for instance, all birds in the flock can share their discovery and help the entire flock get the best hunt.

While we can simulate the movement of a flock of birds, we can also imagine each bird is to help us find the optimal solution in a high-dimensional solution space and the best solution found by the flock is the best solution in the space. This is a **heuristic solution** because we can never prove the real **global optimal** solution can be found and it is usually not. However, we often find that the solution found by PSO is quite close to the global optimal.

PSO is best used to find the maximum or minimum of a function defined on a multidimensional vector space. Assume we have a function $f(X)$ that produces a real value from a vector parameter $X$ (such as coordinate $(x,y)$ in a plane) and $X$ can take on virtually any value in the space (for example, $f(X)$ is the altitude and we can find one for any point on the plane), then we can apply PSO. The PSO algorithm will return the parameter $X$ it found that produces the minimum $f(X)$.

Let’s start with the following function

$$

f(x,y) = (x-3.14)^2 + (y-2.72)^2 + \sin(3x+1.41) + \sin(4y-1.73)

$$

As we can see from the plot above, this function looks like a curved egg carton. It is not a **convex function** and therefore it is hard to find its minimum because a **local minimum** found is not necessarily the **global minimum**.

So how can we find the minimum point in this function? For sure, we can resort to exhaustive search: If we check the value of $f(x,y)$ for every point on the plane, we can find the minimum point. Or we can just randomly find some sample points on the plane and see which one gives the lowest value on $f(x,y)$ if we think it is too expensive to search every point. However, we also note from the shape of $f(x,y)$ that if we have found a point with a smaller value of $f(x,y)$, it is easier to find an even smaller value around its proximity.

This is how a particle swarm optimization does. Similar to the flock of birds looking for food, we start with a number of random points on the plane (call them **particles**) and let them look for the minimum point in random directions. At each step, every particle should search around the minimum point it ever found as well as around the minimum point found by the entire swarm of particles. After certain iterations, we consider the minimum point of the function as the minimum point ever explored by this swarm of particles.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Assume we have $P$ particles and we denote the position of particle $i$ at iteration $t$ as $X^i(t)$, which in the example of above, we have it as a coordinate $X^i(t) = (x^i(t), y^i(t)).$ Besides the position, we also have a velocity for each particle, denoted as $V^i(t)=(v_x^i(t), v_y^i(t))$. At the next iteration, the position of each particle would be updated as

$$

X^i(t+1) = X^i(t)+V^i(t+1)

$$

or, equivalently,

$$

\begin{aligned}

x^i(t+1) &= x^i(t) + v_x^i(t+1) \\

y^i(t+1) &= y^i(t) + v_y^i(t+1)

\end{aligned}

$$

and at the same time, the velocities are also updated by the rule

$$

V^i(t+1) =

w V^i(t) + c_1r_1(pbest^i – X^i(t)) + c_2r_2(gbest – X^i(t))

$$

where $r_1$ and $r_2$ are random numbers between 0 and 1, constants $w$, $c_1$, and $c_2$ are parameters to the PSO algorithm, and $pbest^i$ is the position that gives the best $f(X)$ value ever explored by particle $i$ and $gbest$ is that explored by all the particles in the swarm.

Note that $pbest^i$ and $X^i(t)$ are two position vectors and the difference $pbest^i – X^i(t)$ is a vector subtraction. Adding this subtraction to the original velocity $V^i(t)$ is to bring the particle back to the position $pbest^i$. Similar are for the difference $gbest – X^i(t)$.

We call the parameter $w$ the inertia weight constant. It is between 0 and 1 and determines how much should the particle keep on with its previous velocity (i.e., speed and direction of the search). The parameters $c_1$ and $c_2$ are called the cognitive and the social coefficients respectively. They controls how much weight should be given between refining the search result of the particle itself and recognizing the search result of the swarm. We can consider these parameters controls the trade off between **exploration** and **exploitation**.

The positions $pbest^i$ and $gbest$ are updated in each iteration to reflect the best position ever found thus far.

One interesting property of this algorithm that distinguishs it from other optimization algorithms is that it does not depend on the gradient of the objective function. In gradient descent, for example, we look for the minimum of a function $f(X)$ by moving $X$ to the direction of $-\nabla f(X)$ as it is where the function going down the fastest. For any particle at the position $X$ at the moment, how it moves does not depend on which direction is the “down hill” but only on where are $pbest$ and $gbest$. This makes PSO particularly suitable if differentiating $f(X)$ is difficult.

Another property of PSO is that it can be parallelized easily. As we are manipulating multiple particles to find the optimal solution, each particles can be updated in parallel and we only need to collect the updated value of $gbest$ once per iteration. This makes map-reduce architecture a perfect candidate to implement PSO.

Here we show how we can implement PSO to find the optimal solution.

For the same function as we showed above, we can first define it as a Python function and show it in a contour plot:

import numpy as np import matplotlib.pyplot as plt def f(x,y): "Objective function" return (x-3.14)**2 + (y-2.72)**2 + np.sin(3*x+1.41) + np.sin(4*y-1.73) # Contour plot: With the global minimum showed as "X" on the plot x, y = np.array(np.meshgrid(np.linspace(0,5,100), np.linspace(0,5,100))) z = f(x, y) x_min = x.ravel()[z.argmin()] y_min = y.ravel()[z.argmin()] plt.figure(figsize=(8,6)) plt.imshow(z, extent=[0, 5, 0, 5], origin='lower', cmap='viridis', alpha=0.5) plt.colorbar() plt.plot([x_min], [y_min], marker='x', markersize=5, color="white") contours = plt.contour(x, y, z, 10, colors='black', alpha=0.4) plt.clabel(contours, inline=True, fontsize=8, fmt="%.0f") plt.show()

Here we plotted the function $f(x,y)$ in the region of $0\le x,y\le 5$. We can create 20 particles at random locations in this region, together with random velocities sampled over a normal distribution with mean 0 and standard deviation 0.1, as follows:

n_particles = 20 X = np.random.rand(2, n_particles) * 5 V = np.random.randn(2, n_particles) * 0.1

which we can show their position on the same contour plot:

From this, we can already find the $gbest$ as the best position ever found by all the particles. Since the particles did not explore at all, their current position is their $pbest^i$ as well:

pbest = X pbest_obj = f(X[0], X[1]) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min()

The vector `pbest_obj`

is the best value of the objective function found by each particle. Similarly, `gbest_obj`

is the best scalar value of the objective function ever found by the swarm. We are using `min()`

and `argmin()`

functions here because we set it as a minimization problem. The position of `gbest`

is marked as a star below

Let’s set $c_1=c_2=0.1$ and $w=0.8$. Then we can update the positions and velocities according to the formula we mentioned above, and then update $pbest^i$ and $gbest$ afterwards:

c1 = c2 = 0.1 w = 0.8 # One iteration r = np.random.rand(2) V = w * V + c1*r[0]*(pbest - X) + c2*r[1]*(gbest.reshape(-1,1)-X) X = X + V obj = f(X[0], X[1]) pbest[:, (pbest_obj >= obj)] = X[:, (pbest_obj >= obj)] pbest_obj = np.array([pbest_obj, obj]).max(axis=0) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min()

The following is the position after the first iteration. We mark the best position of each particle with a black dot to distinguish from their current position, which are set in blue.

We can repeat the above code segment for multiple times and see how the particles explore. This is the result after the second iteration:

and this is after the 5th iteration, note that the position of $gbest$ as denoted by the star changed:

and after 20th iteration, we already very close to the optimal:

This is the animation showing how we find the optimal solution as the algorithm progressed. See if you may find some resemblance to the movement of a flock of birds:

So how close is our solution? In this particular example, the global minimum we found by exhaustive search is at the coordinate $(3.182,3.131)$ and the one found by PSO algorithm above is at $(3.185,3.130)$.

All PSO algorithms are mostly the same as we mentioned above. In the above example, we set the PSO to run in a fixed number of iterations. It is trivial to set the number of iterations to run dynamically in response to the progress. For example, we can make it stop once we cannot see any update to the global best solution $gbest$ in a number of iterations.

Research on PSO were mostly on how to determine the hyperparameters $w$, $c_1$, and $c_2$ or varying their values as the algorithm progressed. For example, there are proposals making the inertia weight linear decreasing. There are also proposals trying to make the cognitive coefficient $c_1$ decreasing while the social coefficient $c_2$ increasing to bring more exploration at the beginning and more exploitation at the end. See, for example, Shi and Eberhart (1998) and Eberhart and Shi (2000).

It should be easy to see how we can change the above code to solve a higher dimensional objective function, or switching from minimization to maximization. The following is the complete example of finding the minimum point of the function $f(x,y)$ proposed above, together with the code to generate the plot animation:

import numpy as np import matplotlib.pyplot as plt from matplotlib.animation import FuncAnimation def f(x,y): "Objective function" return (x-3.14)**2 + (y-2.72)**2 + np.sin(3*x+1.41) + np.sin(4*y-1.73) # Compute and plot the function in 3D within [0,5]x[0,5] x, y = np.array(np.meshgrid(np.linspace(0,5,100), np.linspace(0,5,100))) z = f(x, y) # Find the global minimum x_min = x.ravel()[z.argmin()] y_min = y.ravel()[z.argmin()] # Hyper-parameter of the algorithm c1 = c2 = 0.1 w = 0.8 # Create particles n_particles = 20 np.random.seed(100) X = np.random.rand(2, n_particles) * 5 V = np.random.randn(2, n_particles) * 0.1 # Initialize data pbest = X pbest_obj = f(X[0], X[1]) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min() def update(): "Function to do one iteration of particle swarm optimization" global V, X, pbest, pbest_obj, gbest, gbest_obj # Update params r1, r2 = np.random.rand(2) V = w * V + c1*r1*(pbest - X) + c2*r2*(gbest.reshape(-1,1)-X) X = X + V obj = f(X[0], X[1]) pbest[:, (pbest_obj >= obj)] = X[:, (pbest_obj >= obj)] pbest_obj = np.array([pbest_obj, obj]).min(axis=0) gbest = pbest[:, pbest_obj.argmin()] gbest_obj = pbest_obj.min() # Set up base figure: The contour map fig, ax = plt.subplots(figsize=(8,6)) fig.set_tight_layout(True) img = ax.imshow(z, extent=[0, 5, 0, 5], origin='lower', cmap='viridis', alpha=0.5) fig.colorbar(img, ax=ax) ax.plot([x_min], [y_min], marker='x', markersize=5, color="white") contours = ax.contour(x, y, z, 10, colors='black', alpha=0.4) ax.clabel(contours, inline=True, fontsize=8, fmt="%.0f") pbest_plot = ax.scatter(pbest[0], pbest[1], marker='o', color='black', alpha=0.5) p_plot = ax.scatter(X[0], X[1], marker='o', color='blue', alpha=0.5) p_arrow = ax.quiver(X[0], X[1], V[0], V[1], color='blue', width=0.005, angles='xy', scale_units='xy', scale=1) gbest_plot = plt.scatter([gbest[0]], [gbest[1]], marker='*', s=100, color='black', alpha=0.4) ax.set_xlim([0,5]) ax.set_ylim([0,5]) def animate(i): "Steps of PSO: algorithm update and show in plot" title = 'Iteration {:02d}'.format(i) # Update params update() # Set picture ax.set_title(title) pbest_plot.set_offsets(pbest.T) p_plot.set_offsets(X.T) p_arrow.set_offsets(X.T) p_arrow.set_UVC(V[0], V[1]) gbest_plot.set_offsets(gbest.reshape(1,-1)) return ax, pbest_plot, p_plot, p_arrow, gbest_plot anim = FuncAnimation(fig, animate, frames=list(range(1,50)), interval=500, blit=False, repeat=True) anim.save("PSO.gif", dpi=120, writer="imagemagick") print("PSO found best solution at f({})={}".format(gbest, gbest_obj)) print("Global optimal at f({})={}".format([x_min,y_min], f(x_min,y_min)))

These are the original papers that proposed the particle swarm optimization, and the early research on refining its hyperparameters:

- Kennedy J. and Eberhart R. C. Particle swarm optimization. In
*Proceedings of the International Conference on Neural Networks*; Institute of Electrical and Electronics Engineers. Vol. 4. 1995. pp. 1942–1948. DOI: 10.1109/ICNN.1995.488968 - Eberhart R. C. and Shi Y. Comparing inertia weights and constriction factors in particle swarm optimization. In
*Proceedings of the 2000 Congress on Evolutionary Computation (CEC ‘00)*. Vol. 1. 2000. pp. 84–88. DOI: 10.1109/CEC.2000.870279 - Shi Y. and Eberhart R. A modified particle swarm optimizer. In
*Proceedings of the IEEE International Conferences on Evolutionary Computation*, 1998. pp. 69–73. DOI: 10.1109/ICEC.1998.699146

In this tutorial we learned:

- How particle swarm optimization works
- How to implement the PSO algorithm
- Some possible variations in the algorithm

As particle swarm optimization does not have a lot of hyper-parameters and very permissive on the objective function, it can be used to solve a wide range of problems.

The post A Gentle Introduction to Particle Swarm Optimization appeared first on Machine Learning Mastery.

]]>The post Differential Evolution from Scratch in Python appeared first on Machine Learning Mastery.

]]>The differential evolution algorithm belongs to a broader family of evolutionary computing algorithms. Similar to other popular direct search approaches, such as genetic algorithms and evolution strategies, the differential evolution algorithm starts with an initial population of candidate solutions. These candidate solutions are iteratively improved by introducing mutations into the population, and retaining the fittest candidate solutions that yield a lower objective function value.

The differential evolution algorithm is advantageous over the aforementioned popular approaches because it can handle nonlinear and non-differentiable multi-dimensional objective functions, while requiring very few control parameters to steer the minimisation. These characteristics make the algorithm easier and more practical to use.

In this tutorial, you will discover the differential evolution algorithm for global optimisation.

After completing this tutorial, you will know:

- Differential evolution is a heuristic approach for the global optimisation of nonlinear and non- differentiable continuous space functions.
- How to implement the differential evolution algorithm from scratch in Python.
- How to apply the differential evolution algorithm to a real-valued 2D objective function.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

**June/2021**: Fixed mutation operation in the code to match the description.

This tutorial is divided into three parts; they are:

- Differential Evolution
- Differential Evolution Algorithm From Scratch
- Differential Evolution Algorithm on the Sphere Function

Differential evolution is a heuristic approach for the global optimisation of nonlinear and non- differentiable continuous space functions.

For a minimisation algorithm to be considered practical, it is expected to fulfil five different requirements:

(1) Ability to handle non-differentiable, nonlinear and multimodal cost functions.

(2) Parallelizability to cope with computation intensive cost functions.

(3) Ease of use, i.e. few control variables to steer the minimization. These variables should

also be robust and easy to choose.

(4) Good convergence properties, i.e. consistent convergence to the global minimum in

consecutive independent trials.

— A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.

The strength of the differential evolution algorithm stems from the fact that it was designed to fulfil all of the above requirements.

Differential Evolution (DE) is arguably one of the most powerful and versatile evolutionary optimizers for the continuous parameter spaces in recent times.

— Recent advances in differential evolution: An updated survey, 2016.

The algorithm begins by randomly initiating a population of real-valued decision vectors, also known as genomes or chromosomes. These represent the candidates solutions to the multi- dimensional optimisation problem.

At each iteration, the algorithm introduces mutations into the population to generate new candidate solutions. The mutation process adds the weighted difference between two population vectors to a third vector, to produce a mutated vector. The parameters of the mutated vector are again mixed with the parameters of another predetermined vector, the target vector, during a process known as crossover that aims to increase the diversity of the perturbed parameter vectors. The resulting vector is known as the trial vector.

DE generates new parameter vectors by adding the weighted difference between two population vectors to a third vector. Let this operation be called mutation.

In order to increase the diversity of the perturbed parameter vectors, crossover is introduced.

— A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.

These mutations are generated according to a mutation strategy, which follows a general naming convention of DE/x/y/z, where DE stands for Differential Evolution, while x denotes the vector to be mutated, y denotes the number of difference vectors considered for the mutation of x, and z is the type of crossover in use. For instance, the popular strategies:

- DE/rand/1/bin
- DE/best/2/bin

Specify that vector x can either be picked randomly (rand) from the population, or else the vector with the lowest cost (best) is selected; that the number of difference vectors under consideration is either 1 or 2; and that crossover is performed according to independent binomial (bin) experiments. The DE/best/2/bin strategy, in particular, appears to be highly beneficial in improving the diversity of the population if the population size is large enough.

The usage of two difference vectors seems to improve the diversity of the population if the number of population vectors NP is high enough.

— A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.

A final selection operation replaces the target vector, or the parent, by the trial vector, its offspring, if the latter yields a lower objective function value. Hence, the fitter offspring now becomes a member of the newly generated population, and subsequently participates in the mutation of further population members. These iterations continue until a termination criterion is reached.

The iterations continue till a termination criterion (such as exhaustion of maximum functional evaluations) is satisfied.

— Recent advances in differential evolution: An updated survey, 2016.

The differential evolution algorithm requires very few parameters to operate, namely the population size, NP, a real and constant scale factor, F ∈ [0, 2], that weights the differential variation during the mutation process, and a crossover rate, CR ∈ [0, 1], that is determined experimentally. This makes the algorithm easy and practical to use.

In addition, the canonical DE requires very few control parameters (3 to be precise: the scale factor, the crossover rate and the population size) — a feature that makes it easy to use for the practitioners.

— Recent advances in differential evolution: An updated survey, 2016.

There have been further variants to the canonical differential evolution algorithm described above,

which one may read on in Recent advances in differential evolution – An updated survey, 2016.

Now that we are familiar with the differential evolution algorithm, let’s look at how to implement it from scratch.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

In this section, we will explore how to implement the differential evolution algorithm from scratch.

The differential evolution algorithm begins by generating an initial population of candidate solutions. For this purpose, we shall use the rand() function to generate an array of random values sampled from a uniform distribution over the range, [0, 1).

We will then scale these values to change the range of their distribution to (lower bound, upper bound), where the bounds are specified in the form of a 2D array with each dimension corresponding to each input variable.

... # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0]))

It is within these same bounds that the objective function will also be evaluated. An objective function of choice and the bounds on each input variable may, therefore, be defined as follows:

# define objective function def obj(x): return 0 # define lower and upper bounds bounds = asarray([-5.0, 5.0])

We can evaluate our initial population of candidate solutions by passing it to the objective function as input argument.

... # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop]

We shall be replacing the values in obj_all with better ones as the population evolves and converges towards an optimal solution.

We can then loop over a predefined number of iterations of the algorithm, such as 100 or 1,000, as specified by parameter, iter, as well as over all candidate solutions.

... # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): ...

The first step of the algorithm iteration performs a mutation process. For this purpose, three random candidates, a, b and c, that are not the current one, are randomly selected from the population and a mutated vector is generated by computing: a + F * (b – c). Recall that F ∈ [0, 2] and denotes the mutation scale factor.

... # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)]

The mutation process is performed by the function, mutation, to which we pass a, b, c and F as input arguments.

# define mutation operation def mutation(x, F): return x[0] + F * (x[1] - x[2]) ... # perform mutation mutated = mutation([a, b, c], F) ...

Since we are operating within a bounded range of values, we need to check whether the newly mutated vector is also within the specified bounds, and if not, clip its values to the upper or lower limits as necessary. This check is carried out by the function, check_bounds.

# define boundary check operation def check_bounds(mutated, bounds): mutated_bound = [clip(mutated[i], bounds[i, 0], bounds[i, 1]) for i in range(len(bounds))] return mutated_bound

The next step performs crossover, where specific values of the current, target, vector are replaced by the corresponding values in the mutated vector, to create a trial vector. The decision of which values to replace is based on whether a uniform random value generated for each input variable falls below a crossover rate. If it does, then the corresponding values from the mutated vector are copied to the target vector.

The crossover process is implemented by the crossover() function, which takes the mutated and target vectors as input, as well as the crossover rate, cr ∈ [0, 1], and the number of input variables.

# define crossover operation def crossover(mutated, target, dims, cr): # generate a uniform random value for every dimension p = rand(dims) # generate trial vector by binomial crossover trial = [mutated[i] if p[i] < cr else target[i] for i in range(dims)] return trial ... # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) ...

A final selection step replaces the target vector by the trial vector if the latter yields a lower objective function value. For this purpose, we evaluate both vectors on the objective function and subsequently perform selection, storing the new objective function value in obj_all if the trial vector is found to be the fittest of the two.

... # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial

We can tie all steps together into a differential_evolution() function that takes as input arguments the population size, the bounds of each input variable, the total number of iterations, the mutation scale factor and the crossover rate, and returns the best solution found and its evaluation.

def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj]

Now that we have implemented the differential evolution algorithm, let’s investigate how to use it to optimise an objective function.

In this section, we will apply the differential evolution algorithm to an objective function.

We will use a simple two-dimensional sphere objective function specified within the bounds, [-5, 5]. The sphere function is continuous, convex and unimodal, and is characterised by a single global minimum at f(0, 0) = 0.0.

# define objective function def obj(x): return x[0]**2.0 + x[1]**2.0

We will minimise this objective function with the differential evolution algorithm, based on the strategy DE/rand/1/bin.

In order to do so, we must define values for the algorithm parameters, specifically for the population size, the number of iterations, the mutation scale factor and the crossover rate. We set these values empirically to, 10, 100, 0.5 and 0.7 respectively.

... # define population size pop_size = 10 # define number of iterations iter = 100 # define scale factor for mutation F = 0.5 # define crossover rate for recombination cr = 0.7

We also define the bounds of each input variable.

... # define lower and upper bounds for every dimension bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)])

Next, we carry out the search and report the results.

... # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr)

Tying this all together, the complete example is listed below.

# differential evolution search of the two-dimensional sphere objective function from numpy.random import rand from numpy.random import choice from numpy import asarray from numpy import clip from numpy import argmin from numpy import min from numpy import around # define objective function def obj(x): return x[0]**2.0 + x[1]**2.0 # define mutation operation def mutation(x, F): return x[0] + F * (x[1] - x[2]) # define boundary check operation def check_bounds(mutated, bounds): mutated_bound = [clip(mutated[i], bounds[i, 0], bounds[i, 1]) for i in range(len(bounds))] return mutated_bound # define crossover operation def crossover(mutated, target, dims, cr): # generate a uniform random value for every dimension p = rand(dims) # generate trial vector by binomial crossover trial = [mutated[i] if p[i] < cr else target[i] for i in range(dims)] return trial def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj] # define population size pop_size = 10 # define lower and upper bounds for every dimension bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)]) # define number of iterations iter = 100 # define scale factor for mutation F = 0.5 # define crossover rate for recombination cr = 0.7 # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr) print('\nSolution: f([%s]) = %.5f' % (around(solution[0], decimals=5), solution[1]))

Running the example reports the progress of the search including the iteration number, and the response from the objective function each time an improvement is detected.

At the end of the search, the best solution is found and its evaluation is reported.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the algorithm converges very close to f(0.0, 0.0) = 0.0 in about 33 improvements out of 100 iterations.

Iteration: 1 f([[ 0.89709 -0.45082]]) = 1.00800 Iteration: 2 f([[-0.5382 0.29676]]) = 0.37773 Iteration: 3 f([[ 0.41884 -0.21613]]) = 0.22214 Iteration: 4 f([[0.34737 0.29676]]) = 0.20873 Iteration: 5 f([[ 0.20692 -0.1747 ]]) = 0.07334 Iteration: 7 f([[-0.23154 -0.00557]]) = 0.05364 Iteration: 8 f([[ 0.11956 -0.02632]]) = 0.01499 Iteration: 11 f([[ 0.01535 -0.02632]]) = 0.00093 Iteration: 15 f([[0.01918 0.01603]]) = 0.00062 Iteration: 18 f([[0.01706 0.00775]]) = 0.00035 Iteration: 20 f([[0.00467 0.01275]]) = 0.00018 Iteration: 21 f([[ 0.00288 -0.00175]]) = 0.00001 Iteration: 27 f([[ 0.00286 -0.00175]]) = 0.00001 Iteration: 30 f([[-0.00059 0.00044]]) = 0.00000 Iteration: 37 f([[-1.5e-04 8.0e-05]]) = 0.00000 Iteration: 41 f([[-1.e-04 -8.e-05]]) = 0.00000 Iteration: 43 f([[-4.e-05 6.e-05]]) = 0.00000 Iteration: 48 f([[-2.e-05 6.e-05]]) = 0.00000 Iteration: 49 f([[-6.e-05 0.e+00]]) = 0.00000 Iteration: 50 f([[-4.e-05 1.e-05]]) = 0.00000 Iteration: 51 f([[1.e-05 1.e-05]]) = 0.00000 Iteration: 55 f([[1.e-05 0.e+00]]) = 0.00000 Iteration: 64 f([[-0. -0.]]) = 0.00000 Iteration: 68 f([[ 0. -0.]]) = 0.00000 Iteration: 72 f([[-0. 0.]]) = 0.00000 Iteration: 77 f([[-0. 0.]]) = 0.00000 Iteration: 79 f([[0. 0.]]) = 0.00000 Iteration: 84 f([[ 0. -0.]]) = 0.00000 Iteration: 86 f([[-0. -0.]]) = 0.00000 Iteration: 87 f([[-0. -0.]]) = 0.00000 Iteration: 95 f([[-0. 0.]]) = 0.00000 Iteration: 98 f([[-0. 0.]]) = 0.00000 Solution: f([[-0. 0.]]) = 0.00000

We can plot the objective function values returned at every improvement by modifying the differential_evolution() function slightly to keep track of the objective function values and return this in the list, obj_iter.

def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # initialise list to store the objective function value at each iteration obj_iter = list() # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj obj_iter.append(best_obj) # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj, obj_iter]

We can then create a line plot of these objective function values to see the relative changes at every improvement during the search.

from matplotlib import pyplot ... # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr) ... # line plot of best objective function values pyplot.plot(solution[2], '.-') pyplot.xlabel('Improvement Number') pyplot.ylabel('Evaluation f(x)') pyplot.show()

Tying this together, the complete example is listed below.

# differential evolution search of the two-dimensional sphere objective function from numpy.random import rand from numpy.random import choice from numpy import asarray from numpy import clip from numpy import argmin from numpy import min from numpy import around from matplotlib import pyplot # define objective function def obj(x): return x[0]**2.0 + x[1]**2.0 # define mutation operation def mutation(x, F): return x[0] + F * (x[1] - x[2]) # define boundary check operation def check_bounds(mutated, bounds): mutated_bound = [clip(mutated[i], bounds[i, 0], bounds[i, 1]) for i in range(len(bounds))] return mutated_bound # define crossover operation def crossover(mutated, target, dims, cr): # generate a uniform random value for every dimension p = rand(dims) # generate trial vector by binomial crossover trial = [mutated[i] if p[i] < cr else target[i] for i in range(dims)] return trial def differential_evolution(pop_size, bounds, iter, F, cr): # initialise population of candidate solutions randomly within the specified bounds pop = bounds[:, 0] + (rand(pop_size, len(bounds)) * (bounds[:, 1] - bounds[:, 0])) # evaluate initial population of candidate solutions obj_all = [obj(ind) for ind in pop] # find the best performing vector of initial population best_vector = pop[argmin(obj_all)] best_obj = min(obj_all) prev_obj = best_obj # initialise list to store the objective function value at each iteration obj_iter = list() # run iterations of the algorithm for i in range(iter): # iterate over all candidate solutions for j in range(pop_size): # choose three candidates, a, b and c, that are not the current one candidates = [candidate for candidate in range(pop_size) if candidate != j] a, b, c = pop[choice(candidates, 3, replace=False)] # perform mutation mutated = mutation([a, b, c], F) # check that lower and upper bounds are retained after mutation mutated = check_bounds(mutated, bounds) # perform crossover trial = crossover(mutated, pop[j], len(bounds), cr) # compute objective function value for target vector obj_target = obj(pop[j]) # compute objective function value for trial vector obj_trial = obj(trial) # perform selection if obj_trial < obj_target: # replace the target vector with the trial vector pop[j] = trial # store the new objective function value obj_all[j] = obj_trial # find the best performing vector at each iteration best_obj = min(obj_all) # store the lowest objective function value if best_obj < prev_obj: best_vector = pop[argmin(obj_all)] prev_obj = best_obj obj_iter.append(best_obj) # report progress at each iteration print('Iteration: %d f([%s]) = %.5f' % (i, around(best_vector, decimals=5), best_obj)) return [best_vector, best_obj, obj_iter] # define population size pop_size = 10 # define lower and upper bounds for every dimension bounds = asarray([(-5.0, 5.0), (-5.0, 5.0)]) # define number of iterations iter = 100 # define scale factor for mutation F = 0.5 # define crossover rate for recombination cr = 0.7 # perform differential evolution solution = differential_evolution(pop_size, bounds, iter, F, cr) print('\nSolution: f([%s]) = %.5f' % (around(solution[0], decimals=5), solution[1])) # line plot of best objective function values pyplot.plot(solution[2], '.-') pyplot.xlabel('Improvement Number') pyplot.ylabel('Evaluation f(x)') pyplot.show()

Running the example creates a line plot.

The line plot shows the objective function evaluation for each improvement, with large changes initially and very small changes towards the end of the search as the algorithm converged on the optima.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

This section provides more resources on the topic if you are looking to go deeper.

- A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces, 1997.
- Recent advances in differential evolution: An updated survey, 2016.

- Algorithms for Optimization, 2019.

In this tutorial, you discovered the differential evolution algorithm.

Specifically, you learned:

- Differential evolution is a heuristic approach for the global optimisation of nonlinear and non- differentiable continuous space functions.
- How to implement the differential evolution algorithm from scratch in Python.
- How to apply the differential evolution algorithm to a real-valued 2D objective function.

The post Differential Evolution from Scratch in Python appeared first on Machine Learning Mastery.

]]>The post Modeling Pipeline Optimization With scikit-learn appeared first on Machine Learning Mastery.

]]>A machine learning pipeline can be created by putting together a sequence of steps involved in training a machine learning model. It can be used to automate a machine learning workflow. The pipeline can involve pre-processing, feature selection, classification/regression, and post-processing. More complex applications may need to fit in other necessary steps within this pipeline.

By optimization, we mean tuning the model for the best performance. The success of any learning model rests on the selection of the best parameters that give the best possible results. Optimization can be looked at in terms of a search algorithm, which walks through a space of parameters and hunts down the best out of them.

After completing this tutorial, you should:

- Appreciate the significance of a pipeline and its optimization.
- Be able to set up a machine learning pipeline.
- Be able to optimize the pipeline.
- Know techniques to analyze the results of optimization.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial will show you how to

- Set up a pipeline using the Pipeline object from sklearn.pipeline.
- Perform a grid search for the best parameters using GridSearchCV() from sklearn.model_selection
- Analyze the results from the GridSearchCV() and visualize them

Before we demonstrate all the above, let’s write the import section:

from pandas import read_csv # For dataframes from pandas import DataFrame # For dataframes from numpy import ravel # For matrices import matplotlib.pyplot as plt # For plotting data import seaborn as sns # For plotting data from sklearn.model_selection import train_test_split # For train/test splits from sklearn.neighbors import KNeighborsClassifier # The k-nearest neighbor classifier from sklearn.feature_selection import VarianceThreshold # Feature selector from sklearn.pipeline import Pipeline # For setting up pipeline # Various pre-processing steps from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder from sklearn.model_selection import GridSearchCV # For optimization

We’ll use the Ecoli Dataset from the UCI Machine Learning Repository to demonstrate all the concepts of this tutorial. This dataset is maintained by Kenta Nakai. Let’s first load the Ecoli dataset in a Pandas DataFrame and view the first few rows.

# Read ecoli dataset from the UCI ML Repository and store in # dataframe df df = read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data', sep = '\s+', header=None) print(df.head())

Running the example you should see the following:

0 1 2 3 4 5 6 7 8 0 AAT_ECOLI 0.49 0.29 0.48 0.5 0.56 0.24 0.35 cp 1 ACEA_ECOLI 0.07 0.40 0.48 0.5 0.54 0.35 0.44 cp 2 ACEK_ECOLI 0.56 0.40 0.48 0.5 0.49 0.37 0.46 cp 3 ACKA_ECOLI 0.59 0.49 0.48 0.5 0.52 0.45 0.36 cp 4 ADI_ECOLI 0.23 0.32 0.48 0.5 0.55 0.25 0.35 cp

We’ll ignore the first column, which specifies the sequence name. The last column is the class label. Let’s separate the features from the class label and split the dataset into 2/3 training instances and 1/3 test examples.

... # The data matrix X X = df.iloc[:,1:-1] # The labels y = (df.iloc[:,-1:]) # Encode the labels into unique integers encoder = LabelEncoder() y = encoder.fit_transform(ravel(y)) # Split the data into test and train X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=1/3, random_state=0) print(X_train.shape) print(X_test.shape)

Running the example you should see the following:

(224, 7) (112, 7)

Great! Now we have 224 samples in the training set and 112 samples in the test set. We have chosen a small dataset so that we can focus on the concepts, rather than the data itself.

For this tutorial, we have chosen the k-nearest neighbor classifier to perform the classification of this dataset.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

First, let’s just check how the k-nearest neighbor performs on the training and test sets. This would give us a baseline for performance.

... knn = KNeighborsClassifier().fit(X_train, y_train) print('Training set score: ' + str(knn.score(X_train,y_train))) print('Test set score: ' + str(knn.score(X_test,y_test)))

Running the example you should see the following:

Training set score: 0.9017857142857143 Test set score: 0.8482142857142857

We should keep in mind that the true judge of a classifier’s performance is the test set score and not the training set score. The test set score reflects the generalization ability of a classifier.

For this tutorial, we’ll set up a very basic pipeline that consists of the following sequence:

**Scaler**: For pre-processing data, i.e., transform the data to zero mean and unit variance using the StandardScaler().**Feature selector**: Use VarianceThreshold() for discarding features whose variance is less than a certain defined threshold.**Classifier**: KNeighborsClassifier(), which implements the k-nearest neighbor classifier and selects the class of the majority k points, which are closest to the test example.

... pipe = Pipeline([ ('scaler', StandardScaler()), ('selector', VarianceThreshold()), ('classifier', KNeighborsClassifier()) ])

The pipe object is simple to understand. It says, scale first, select features second and classify in the end. Let’s call fit() method of the pipe object on our training data and get the training and test scores.

... pipe.fit(X_train, y_train) print('Training set score: ' + str(pipe.score(X_train,y_train))) print('Test set score: ' + str(pipe.score(X_test,y_test)))

Running the example you should see the following:

Training set score: 0.8794642857142857 Test set score: 0.8392857142857143

So it looks like the performance of this pipeline is worse than the single classifier performance on raw data. Not only did we add extra processing, but it was all in vain. Don’t despair, the real benefit of the pipeline comes from its tuning. The next section explains how to do that.

In the code below, we’ll show the following:

- We can search for the best scalers. Instead of just the StandardScaler(), we can try MinMaxScaler(), Normalizer() and MaxAbsScaler().
- We can search for the best variance threshold to use in the selector, i.e., VarianceThreshold().
- We can search for the best value of k for the KNeighborsClassifier().

The parameters variable below is a dictionary that specifies the key:value pairs. Note the key must be written, with a double underscore __ separating the module name that we selected in the Pipeline() and its parameter. Note the following:

- The scaler has no double underscore, as we have specified a list of objects there.
- We would search for the best threshold for the selector, i.e., VarianceThreshold(). Hence we have specified a list of values [0, 0.0001, 0.001, 0.5] to choose from.
- Different values are specified for the n_neighbors, p and leaf_size parameters of the KNeighborsClassifier().

... parameters = {'scaler': [StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()], 'selector__threshold': [0, 0.001, 0.01], 'classifier__n_neighbors': [1, 3, 5, 7, 10], 'classifier__p': [1, 2], 'classifier__leaf_size': [1, 5, 10, 15] }

The pipe along with the above list of parameters are then passed to a GridSearchCV() object, that searches the parameters space for the best set of parameters as shown below:

... grid = GridSearchCV(pipe, parameters, cv=2).fit(X_train, y_train) print('Training set score: ' + str(grid.score(X_train, y_train))) print('Test set score: ' + str(grid.score(X_test, y_test)))

Running the example you should see the following:

Training set score: 0.8928571428571429 Test set score: 0.8571428571428571

By tuning the pipeline, we achieved quite an improvement over a simple classifier and a non-optimized pipeline. It is important to analyze the results of the optimization process.

Don’t worry too much about the warning that you get by running the code above. It is generated because we have very few training samples and the cross-validation object does not have enough samples for a class for one of its folds.

Let’s look at the tuned grid object and gain an understanding of the GridSearchCV() object.

The object is so named because it sets up a multi-dimensional grid, with each corner representing a combination of parameters to try. This defines a parameter space. As an example if we have three values of n_neighbors, i.e., {1,3,5}, two values of leaf_size, i.e., {1,5} and two values of threshold, i.e., {0,0.0001}, then we have a 3D grid with 3x2x2=12 corners. Each corner represents a different combination.

For each corner of the above grid, the GridSearchCV() object computes the mean cross-validation score on the unseen examples and selects the corner/combination of parameters that give the best result. The code below shows how to access the best parameters of the grid and the best pipeline for our task.

... # Access the best set of parameters best_params = grid.best_params_ print(best_params) # Stores the optimum model in best_pipe best_pipe = grid.best_estimator_ print(best_pipe)

Running the example you should see the following:

{'classifier__leaf_size': 1, 'classifier__n_neighbors': 7, 'classifier__p': 2, 'scaler': StandardScaler(), 'selector__threshold': 0} Pipeline(steps=[('scaler', StandardScaler()), ('selector', VarianceThreshold(threshold=0)), ('classifier', KNeighborsClassifier(leaf_size=1, n_neighbors=7))])

Another useful technique for analyzing the results is to construct a DataFrame from the grid.cv_results_. Let’s view the columns of this data frame.

... result_df = DataFrame.from_dict(grid.cv_results_, orient='columns') print(result_df.columns)

Running the example you should see the following:

Index(['mean_fit_time', 'std_fit_time', 'mean_score_time', 'std_score_time', 'param_classifier__leaf_size', 'param_classifier__n_neighbors', 'param_classifier__p', 'param_scaler', 'param_selector__threshold', 'params', 'split0_test_score', 'split1_test_score', 'mean_test_score', 'std_test_score', 'rank_test_score'], dtype='object')

This DataFrame is very valuable as it shows us the scores for different parameters. The column with the mean_test_score is the average of the scores on the test set for all the folds during cross-validation. The DataFrame may be too big to visualize manually, hence, it is always a good idea to plot the results. Let’s see how n_neighbors affect the performance for different scalers and for different values of p.

... sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__p') plt.show()

Running the example you should see the following:

The plots clearly show that using StandardScaler(), with n_neighbors=7 and p=2, gives the best result. Let’s make one more set of plots with leaf_size.

... sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__leaf_size') plt.show()

Running the example you should see the following:

Tying this all together, the complete code example is listed below.

from pandas import read_csv # For dataframes from pandas import DataFrame # For dataframes from numpy import ravel # For matrices import matplotlib.pyplot as plt # For plotting data import seaborn as sns # For plotting data from sklearn.model_selection import train_test_split # For train/test splits from sklearn.neighbors import KNeighborsClassifier # The k-nearest neighbor classifier from sklearn.feature_selection import VarianceThreshold # Feature selector from sklearn.pipeline import Pipeline # For setting up pipeline # Various pre-processing steps from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder from sklearn.model_selection import GridSearchCV # For optimization # Read ecoli dataset from the UCI ML Repository and store in # dataframe df df = read_csv( 'https://archive.ics.uci.edu/ml/machine-learning-databases/ecoli/ecoli.data', sep = '\s+', header=None) print(df.head()) # The data matrix X X = df.iloc[:,1:-1] # The labels y = (df.iloc[:,-1:]) # Encode the labels into unique integers encoder = LabelEncoder() y = encoder.fit_transform(ravel(y)) # Split the data into test and train X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=1/3, random_state=0) print(X_train.shape) print(X_test.shape) knn = KNeighborsClassifier().fit(X_train, y_train) print('Training set score: ' + str(knn.score(X_train,y_train))) print('Test set score: ' + str(knn.score(X_test,y_test))) pipe = Pipeline([ ('scaler', StandardScaler()), ('selector', VarianceThreshold()), ('classifier', KNeighborsClassifier()) ]) pipe.fit(X_train, y_train) print('Training set score: ' + str(pipe.score(X_train,y_train))) print('Test set score: ' + str(pipe.score(X_test,y_test))) parameters = {'scaler': [StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()], 'selector__threshold': [0, 0.001, 0.01], 'classifier__n_neighbors': [1, 3, 5, 7, 10], 'classifier__p': [1, 2], 'classifier__leaf_size': [1, 5, 10, 15] } grid = GridSearchCV(pipe, parameters, cv=2).fit(X_train, y_train) print('Training set score: ' + str(grid.score(X_train, y_train))) print('Test set score: ' + str(grid.score(X_test, y_test))) # Access the best set of parameters best_params = grid.best_params_ print(best_params) # Stores the optimum model in best_pipe best_pipe = grid.best_estimator_ print(best_pipe) result_df = DataFrame.from_dict(grid.cv_results_, orient='columns') print(result_df.columns) sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__p') plt.show() sns.relplot(data=result_df, kind='line', x='param_classifier__n_neighbors', y='mean_test_score', hue='param_scaler', col='param_classifier__leaf_size') plt.show()

In this tutorial we learned the following:

- How to build a machine learning pipeline.
- How to optimize the pipeline using GridSearchCV.
- How to analyze and compare the results attained by using different sets of parameters.

The dataset used for this tutorial is quite small with a few example points but still the results are better than a simple classifier.

For the interested readers, here are a few resources:

- On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation by Gavin Cawley and Nicola Talbot
- A Gentle Introduction to k-fold Cross-Validation
- Model Selection
- k-nearest Neighbors Algorithm
- A Gentle Introduction to Machine Learning Modeling Pipelines

- sklearn.model_selection.ParameterSampler API
- sklearn.model_selection.RandomizedSearchCV API
- sklearn.model_selection.KFold API
- sklearn.model_selection.ParameterGrid API

- UCI Machine Learning Repository maintained by Dua and Graff.
- Ecoli Dataset maintained by Kenta Nakai. Please see this paper for more information.

The post Modeling Pipeline Optimization With scikit-learn appeared first on Machine Learning Mastery.

]]>The post Gradient Descent With AdaGrad From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that it uses the same step size (learning rate) for each input variable. This can be a problem on objective functions that have different amounts of curvature in different dimensions, and in turn, may require a different sized step to a new point.

**Adaptive Gradients**, or **AdaGrad** for short, is an extension of the gradient descent optimization algorithm that allows the step size in each dimension used by the optimization algorithm to be automatically adapted based on the gradients seen for the variable (partial derivatives) seen over the course of the search.

In this tutorial, you will discover how to develop the gradient descent with adaptive gradients optimization algorithm from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable in the objective function, called adaptive gradients or AdaGrad.
- How to implement the AdaGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial is divided into three parts; they are:

- Gradient Descent
- Adaptive Gradient (AdaGrad)
- Gradient Descent With AdaGrad
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AdaGrad
- Visualization of AdaGrad

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first order derivative, or simply the “*derivative*,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the “gradient.”

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x = x – step_size * f'(x)

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**(*alpha*): Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at AdaGrad.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

The Adaptive Gradient algorithm, or AdaGrad for short, is an extension to the gradient descent optimization algorithm.

The algorithm was described by John Duchi, et al. in their 2011 paper titled “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.”

It is designed to accelerate the optimization process, e.g. decrease the number of function evaluations required to reach the optima, or to improve the capability of the optimization algorithm, e.g. result in a better final result.

The parameters with the largest partial derivative of the loss have a correspondingly rapid decrease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their learning rate.

— Page 307, Deep Learning, 2016.

A problem with the gradient descent algorithm is that the step size (learning rate) is the same for each variable or dimension in the search space. It is possible that better performance can be achieved using a step size that is tailored to each variable, allowing larger movements in dimensions with a consistently steep gradient and smaller movements in dimensions with less steep gradients.

AdaGrad is designed to specifically explore the idea of automatically tailoring the step size for each dimension in the search space.

The adaptive subgradient method, or Adagrad, adapts a learning rate for each component of x

— Page 77, Algorithms for Optimization, 2019.

This is achieved by first calculating a step size for a given dimension, then using the calculated step size to make a movement in that dimension using the partial derivative. This process is then repeated for each dimension in the search space.

Adagrad dulls the influence of parameters with consistently high gradients, thereby increasing the influence of parameters with infrequent updates.

— Page 77, Algorithms for Optimization, 2019.

AdaGrad is suited to objective functions where the curvature of the search space is different in different dimensions, allowing a more effective optimization given the customization of the step size in each dimension.

The algorithm requires that you set an initial step size for all input variables as per normal, such as 0.1 or 0.001, or similar. Although, the benefit of the algorithm is that it is not as sensitive to the initial learning rate as the gradient descent algorithm.

Adagrad is far less sensitive to the learning rate parameter alpha. The learning rate parameter is typically set to a default value of 0.01.

— Page 77, Algorithms for Optimization, 2019.

An internal variable is then maintained for each input variable that is the sum of the squared partial derivatives for the input variable observed during the search.

This sum of the squared partial derivatives is then used to calculate the step size for the variable by dividing the initial step size value (e.g. hyperparameter value specified at the start of the run) divided by the square root of the sum of the squared partial derivatives.

- cust_step_size = step_size / sqrt(s)

It is possible for the square root of the sum of squared partial derivatives to result in a value of 0.0, resulting in a divide by zero error. Therefore, a tiny value can be added to the denominator to avoid this possibility, such as 1e-8.

- cust_step_size = step_size / (1e-8 + sqrt(s))

Where *cust_step_size* is the calculated step size for an input variable for a given point during the search, *step_size* is the initial step size, *sqrt()* is the square root operation, and *s* is the sum of the squared partial derivatives for the input variable seen during the search so far.

The custom step size is then used to calculate the value for the variable in the next point or solution in the search.

- x(t+1) = x(t) – cust_step_size * f'(x(t))

This process is then repeated for each input variable until a new point in the search space is created and can be evaluated.

Importantly, the partial derivative for the current solution (iteration of the search) is included in the sum of the square root of partial derivatives.

We could maintain an array of partial derivatives or squared partial derivatives for each input variable, but this is not necessary. Instead, we simply maintain the sum of the squared partial derivatives and add new values to this sum along the way.

Now that we are familiar with the AdaGrad algorithm, let’s explore how we might implement it and evaluate its performance.

In this section, we will explore how to implement the gradient descent optimization algorithm with adaptive gradients.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this function.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AdaGrad optimization algorithm.

We can apply the gradient descent with adaptive gradient algorithm to the test problem.

First, we need a function that calculates the derivative for this function.

- f(x) = x^2
- f'(x) = x * 2

The derivative of x^2 is x * 2 in each dimension.

The *derivative()* function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent with adaptive gradients.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the sum of the squared partial derivatives for each dimension to 0.0 values.

... # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])]

We can then enumerate a fixed number of iterations of the search optimization algorithm defined by a “*n_iter*” hyperparameter.

... # run the gradient descent for it in range(n_iter): ...

The first step is to calculate the gradient for the current solution using the *derivative()* function.

... # calculate gradient gradient = derivative(solution[0], solution[1])

We then need to calculate the square of the partial derivative of each variable and add them to the running sum of these values.

... # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0

We can then use the sum squared partial derivatives and gradient to calculate the next point.

We will do this one variable at a time, first calculating the step size for the variable, then the new value for the variable. These values are built up in an array until we have a completely new solution that is in the steepest descent direction from the current point using the custom step sizes.

... # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] # store this variable new_solution.append(value)

This new solution can then be evaluated using the *objective()* function and the performance of the search can be reported.

... # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

And that’s it.

We can tie all of this together into a function named *adagrad()* that takes the names of the objective function and the derivative function, an array with the bounds of the domain, and hyperparameter values for the total number of algorithm iterations and the initial learning rate, and returns the final solution and its evaluation.

This complete function is listed below.

# gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval]

**Note**: we have intentionally used lists and imperative coding style instead of vectorized operations for readability. Feel free to adapt the implementation to a vectorized implementation with NumPy arrays for better performance.

We can then define our hyperparameters and call the *adagrad()* function to optimize our test objective function.

In this case, we will use 50 iterations of the algorithm and an initial learning rate of 0.1, both chosen after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search with adagrad best, score = adagrad(objective, derivative, bounds, n_iter, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of gradient descent optimization with adaptive gradients is listed below.

# gradient descent optimization with adagrad for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build a solution one variable at a time new_solution = list() for i in range(solution.shape[0]): # calculate the step size for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] # store this variable new_solution.append(value) # evaluate candidate point solution = asarray(new_solution) solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return [solution, solution_eval] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search with adagrad best, score = adagrad(objective, derivative, bounds, n_iter, step_size) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the AdaGrad optimization algorithm to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

>0 f([-0.06595599 0.34064899]) = 0.12039 >1 f([-0.02902286 0.27948766]) = 0.07896 >2 f([-0.0129815 0.23463749]) = 0.05522 >3 f([-0.00582483 0.1993997 ]) = 0.03979 >4 f([-0.00261527 0.17071256]) = 0.02915 >5 f([-0.00117437 0.14686138]) = 0.02157 >6 f([-0.00052736 0.12676134]) = 0.01607 >7 f([-0.00023681 0.10966762]) = 0.01203 >8 f([-0.00010634 0.09503809]) = 0.00903 >9 f([-4.77542704e-05 8.24607972e-02]) = 0.00680 >10 f([-2.14444463e-05 7.16123835e-02]) = 0.00513 >11 f([-9.62980437e-06 6.22327049e-02]) = 0.00387 >12 f([-4.32434258e-06 5.41085063e-02]) = 0.00293 >13 f([-1.94188148e-06 4.70624414e-02]) = 0.00221 >14 f([-8.72017797e-07 4.09453989e-02]) = 0.00168 >15 f([-3.91586740e-07 3.56309531e-02]) = 0.00127 >16 f([-1.75845235e-07 3.10112252e-02]) = 0.00096 >17 f([-7.89647442e-08 2.69937139e-02]) = 0.00073 >18 f([-3.54597657e-08 2.34988084e-02]) = 0.00055 >19 f([-1.59234984e-08 2.04577993e-02]) = 0.00042 >20 f([-7.15057749e-09 1.78112581e-02]) = 0.00032 >21 f([-3.21102543e-09 1.55077005e-02]) = 0.00024 >22 f([-1.44193729e-09 1.35024688e-02]) = 0.00018 >23 f([-6.47513760e-10 1.17567908e-02]) = 0.00014 >24 f([-2.90771361e-10 1.02369798e-02]) = 0.00010 >25 f([-1.30573263e-10 8.91375193e-03]) = 0.00008 >26 f([-5.86349941e-11 7.76164047e-03]) = 0.00006 >27 f([-2.63305247e-11 6.75849105e-03]) = 0.00005 >28 f([-1.18239380e-11 5.88502652e-03]) = 0.00003 >29 f([-5.30963626e-12 5.12447017e-03]) = 0.00003 >30 f([-2.38433568e-12 4.46221948e-03]) = 0.00002 >31 f([-1.07070548e-12 3.88556303e-03]) = 0.00002 >32 f([-4.80809073e-13 3.38343471e-03]) = 0.00001 >33 f([-2.15911255e-13 2.94620023e-03]) = 0.00001 >34 f([-9.69567190e-14 2.56547145e-03]) = 0.00001 >35 f([-4.35392094e-14 2.23394494e-03]) = 0.00000 >36 f([-1.95516389e-14 1.94526160e-03]) = 0.00000 >37 f([-8.77982370e-15 1.69388439e-03]) = 0.00000 >38 f([-3.94265180e-15 1.47499203e-03]) = 0.00000 >39 f([-1.77048011e-15 1.28438640e-03]) = 0.00000 >40 f([-7.95048604e-16 1.11841198e-03]) = 0.00000 >41 f([-3.57023093e-16 9.73885702e-04]) = 0.00000 >42 f([-1.60324146e-16 8.48035867e-04]) = 0.00000 >43 f([-7.19948720e-17 7.38448972e-04]) = 0.00000 >44 f([-3.23298874e-17 6.43023418e-04]) = 0.00000 >45 f([-1.45180009e-17 5.59929193e-04]) = 0.00000 >46 f([-6.51942732e-18 4.87572776e-04]) = 0.00000 >47 f([-2.92760228e-18 4.24566574e-04]) = 0.00000 >48 f([-1.31466380e-18 3.69702307e-04]) = 0.00000 >49 f([-5.90360555e-19 3.21927835e-04]) = 0.00000 Done! f([-5.90360555e-19 3.21927835e-04]) = 0.000000

We can plot the progress of the search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *adagrad()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build solution new_solution = list() for i in range(solution.shape[0]): # calculate the learning rate for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search solutions = adagrad(objective, derivative, bounds, n_iter, step_size)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AdaGrad optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adagrad search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adagrad def adagrad(objective, derivative, bounds, n_iter, step_size): # track all solutions solutions = list() # generate an initial point solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # list of the sum square gradients for each variable sq_grad_sums = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for it in range(n_iter): # calculate gradient gradient = derivative(solution[0], solution[1]) # update the sum of the squared partial derivatives for i in range(gradient.shape[0]): sq_grad_sums[i] += gradient[i]**2.0 # build solution new_solution = list() for i in range(solution.shape[0]): # calculate the learning rate for this variable alpha = step_size / (1e-8 + sqrt(sq_grad_sums[i])) # calculate the new position in this variable value = solution[i] - alpha * gradient[i] new_solution.append(value) # store the new solution solution = asarray(new_solution) solutions.append(solution) # evaluate candidate point solution_eval = objective(solution[0], solution[1]) # report progress print('>%d f(%s) = %.5f' % (it, solution, solution_eval)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 50 # define the step size step_size = 0.1 # perform the gradient descent search solutions = adagrad(objective, derivative, bounds, n_iter, step_size) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, a contour plot of the objective function is created and a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop the gradient descent with adaptive gradients optimization algorithm from scratch.

Specifically, you learned:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- Gradient descent can be updated to use an automatically adaptive step size for each input variable in the objective function, called adaptive gradients or AdaGrad.
- How to implement the AdaGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent With AdaGrad From Scratch appeared first on Machine Learning Mastery.

]]>The post Gradient Descent Optimization With AMSGrad From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent like the Adaptive Movement Estimation (Adam) algorithm use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

**AMSGrad** is an extension to the Adam version of gradient descent that attempts to improve the convergence properties of the algorithm, avoiding large abrupt changes in the learning rate for each input variable.

In this tutorial, you will discover how to develop gradient descent optimization with AMSGrad from scratch.

After completing this tutorial, you will know:

- Gradient descent is an optimization algorithm that uses the gradient of the objective function to navigate the search space.
- AMSGrad is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AMSGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial is divided into three parts; they are:

- Gradient Descent
- AMSGrad Optimization Algorithm
- Gradient Descent With AMSGrad
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AMSGrad
- Visualization of AMSGrad Optimization

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function *f()* returns a score for a given set of inputs, and the derivative function *f'()* gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (*x*) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the AMSGrad algorithm.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

AMSGrad algorithm is an extension to the Adaptive Movement Estimation (Adam) optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2018 paper by J. Sashank, et al. titled “On the Convergence of Adam and Beyond.”

Generally, Adam automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

A limitation of Adam is that it can both decrease the step size when getting close to the optima, which is good, but it also increases the step size in some cases, which is bad.

AdaGrad addresses this specifically.

… ADAM aggressively increases the learning rate, however, […] this can be detrimental to the overall performance of the algorithm. […] In contrast, AMSGRAD neither increases nor decreases the learning rate and furthermore, decreases vt which can potentially lead to non-decreasing learning rate even if gradient is large in the future iterations.

— On the Convergence of Adam and Beyond, 2018.

AdaGrad is an extension to Adam that maintains a maximum of the second moment vector and uses it to bias-correct the gradient used to update the parameter, instead of the moment vector itself. This helps to stop the optimization slowing down too quickly (e.g. premature convergence).

The key difference of AMSGRAD with ADAM is that it maintains the maximum of all vt until the present time step and uses this maximum value for normalizing the running average of the gradient instead of vt in ADAM.

— On the Convergence of Adam and Beyond, 2018.

Let’s step through each element of the algorithm.

First, we must maintain a first and second moment vector as well as a max second moment vector for each parameter being optimized as part of the search, referred to as *m* and *v* (Geek letter nu but we will use v), and *vhat* respectively.

They are initialized to 0.0 at the start of the search.

- m = 0
- v = 0
- vhat = 0

The algorithm is executed iteratively over time t starting at *t=1*, and each iteration involves calculating a new set of parameter values *x*, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradients (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the first moment vector is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1(t) * m(t-1) + (1 – beta1(t)) * g(t)

The *beta1* hyperparameter can be held constant or can be decayed exponentially over the course of the search, such as:

- beta1(t) = beta1^(t)

Or, alternately:

- beta1(t) = beta1 / t

The second moment vector is updated using the square of the gradient and a hyperparameter *beta2*.

- v(t) = beta2 * v(t-1) + (1 – beta2) * g(t)^2

Next, the maximum for the second moment vector is updated.

- vhat(t) = max(vhat(t-1), v(t))

Where *max()* calculates the maximum of the provided values.

The parameter value can then be updated using the calculated terms and the step size hyperparameter *alpha*.

- x(t) = x(t-1) – alpha(t) * m(t) / sqrt(vhat(t)))

Where *sqrt()* is the square root function.

The step size may also be held constant or decayed exponentially.

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

And that’s it.

For full derivation of the AMSGrad algorithm in the context of the Adam algorithm, I recommend reading the paper.

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with AMSGrad Momentum.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AMSGrad optimization algorithm.

We can apply gradient descent with AMSGrad to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of *x^2* is *x * 2* in each dimension.

- f(x) = x^2
- f'(x) = x * 2

The *derivative()* function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization with AMSGrad.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the moment vectors.

... # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the AMSGrad update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the first moment vector.

... # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i]

Next, we need to calculate the second moment vector.

... # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2

Then the maximum of the second moment vector with the previous iteration and the current value.

... # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i])

Finally, we can calculate the new value for the variable.

... # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / sqrt(vhat[i])

We may want to add a small value to the denominator to avoid a divide by zero error; for example:

... # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8)

This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named *amsgrad()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters, and returns the best solution found at the end of the search and its evaluation.

# gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 100 iterations with an initial learning rate of 0.007, beta of 0.9, and a beta2 of 0.99, found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

At the end of the run, we will report the best solution found.

... # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of AMSGrad gradient descent applied to our test problem is listed below.

# gradient descent optimization with amsgrad for a two-dimensional test function from math import sqrt from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad best, score = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the optimization algorithm with AMSGrad to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 90 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >90 f([-5.74880707e-11 2.16227707e-03]) = 0.00000 >91 f([-4.53359947e-11 2.03974264e-03]) = 0.00000 >92 f([-3.57526928e-11 1.92415218e-03]) = 0.00000 >93 f([-2.81951584e-11 1.81511216e-03]) = 0.00000 >94 f([-2.22351711e-11 1.71225138e-03]) = 0.00000 >95 f([-1.75350316e-11 1.61521966e-03]) = 0.00000 >96 f([-1.38284262e-11 1.52368665e-03]) = 0.00000 >97 f([-1.09053366e-11 1.43734076e-03]) = 0.00000 >98 f([-8.60013947e-12 1.35588802e-03]) = 0.00000 >99 f([-6.78222208e-12 1.27905115e-03]) = 0.00000 Done! f([-6.78222208e-12 1.27905115e-03]) = 0.000002

We can plot the progress of the AMSGrad search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the *amsgrad()* function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # keep track of all solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AMSGrad optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the amsgrad search on a contour plot of the test function from math import sqrt from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with amsgrad def amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vectors m = [0.0 for _ in range(bounds.shape[0])] v = [0.0 for _ in range(bounds.shape[0])] vhat = [0.0 for _ in range(bounds.shape[0])] # run the gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # update variables one at a time for i in range(x.shape[0]): # m(t) = beta1(t) * m(t-1) + (1 - beta1(t)) * g(t) m[i] = beta1**(t+1) * m[i] + (1.0 - beta1**(t+1)) * g[i] # v(t) = beta2 * v(t-1) + (1 - beta2) * g(t)^2 v[i] = (beta2 * v[i]) + (1.0 - beta2) * g[i]**2 # vhat(t) = max(vhat(t-1), v(t)) vhat[i] = max(vhat[i], v[i]) # x(t) = x(t-1) - alpha(t) * m(t) / sqrt(vhat(t))) x[i] = x[i] - alpha * m[i] / (sqrt(vhat[i]) + 1e-8) # evaluate candidate point score = objective(x[0], x[1]) # keep track of all solutions solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 100 # steps size alpha = 0.007 # factor for average gradient beta1 = 0.9 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with amsgrad solutions = amsgrad(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- On the Convergence of Adam and Beyond, 2018.
- An Overview Of Gradient Descent Optimization Algorithms, 2016.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.
- Experiments with AMSGrad, 2017.

In this tutorial, you discovered how to develop gradient descent optimization with AMSGrad from scratch.

Specifically, you learned:

- AMSGrad is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AMSGrad optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent Optimization With AMSGrad From Scratch appeared first on Machine Learning Mastery.

]]>The post Gradient Descent Optimization With AdaMax From Scratch appeared first on Machine Learning Mastery.

]]>A limitation of gradient descent is that a single step size (learning rate) is used for all input variables. Extensions to gradient descent, like the Adaptive Movement Estimation (Adam) algorithm, use a separate step size for each input variable but may result in a step size that rapidly decreases to very small values.

**AdaMax** is an extension to the Adam version of gradient descent that generalizes the approach to the infinite norm (max) and may result in a more effective optimization on some problems.

In this tutorial, you will discover how to develop gradient descent optimization with AdaMax from scratch.

After completing this tutorial, you will know:

- AdaMax is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AdaMax optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial is divided into three parts; they are:

- Gradient Descent
- AdaMax Optimization Algorithm
- Gradient Descent With AdaMax
- Two-Dimensional Test Problem
- Gradient Descent Optimization With AdaMax
- Visualization of AdaMax Optimization

Gradient descent is an optimization algorithm.

It is technically referred to as a first-order optimization algorithm as it explicitly makes use of the first-order derivative of the target objective function.

First-order methods rely on gradient information to help direct the search for a minimum …

— Page 69, Algorithms for Optimization, 2019.

The first-order derivative, or simply the “derivative,” is the rate of change or slope of the target function at a specific point, e.g. for a specific input.

If the target function takes multiple input variables, it is referred to as a multivariate function and the input variables can be thought of as a vector. In turn, the derivative of a multivariate target function may also be taken as a vector and is referred to generally as the gradient.

**Gradient**: First-order derivative for a multivariate objective function.

The derivative or the gradient points in the direction of the steepest ascent of the target function for a specific input.

Gradient descent refers to a minimization optimization algorithm that follows the negative of the gradient downhill of the target function to locate the minimum of the function.

The gradient descent algorithm requires a target function that is being optimized and the derivative function for the objective function. The target function f() returns a score for a given set of inputs, and the derivative function f'() gives the derivative of the target function for a given set of inputs.

The gradient descent algorithm requires a starting point (x) in the problem, such as a randomly selected point in the input space.

The derivative is then calculated and a step is taken in the input space that is expected to result in a downhill movement in the target function, assuming we are minimizing the target function.

A downhill movement is made by first calculating how far to move in the input space, calculated as the step size (called alpha or the learning rate) multiplied by the gradient. This is then subtracted from the current point, ensuring we move against the gradient, or down the target function.

- x(t) = x(t-1) – step_size * f'(x(t))

The steeper the objective function at a given point, the larger the magnitude of the gradient, and in turn, the larger the step taken in the search space. The size of the step taken is scaled using a step size hyperparameter.

**Step Size**: Hyperparameter that controls how far to move in the search space against the gradient each iteration of the algorithm.

If the step size is too small, the movement in the search space will be small and the search will take a long time. If the step size is too large, the search may bounce around the search space and skip over the optima.

Now that we are familiar with the gradient descent optimization algorithm, let’s take a look at the AdaMax algorithm.

AdaMax algorithm is an extension to the Adaptive Movement Estimation (Adam) Optimization algorithm. More broadly, is an extension to the Gradient Descent Optimization algorithm.

The algorithm was described in the 2014 paper by Diederik Kingma and Jimmy Lei Ba titled “Adam: A Method for Stochastic Optimization.”

Adam can be understood as updating weights inversely proportional to the scaled L2 norm (squared) of past gradients. AdaMax extends this to the so-called infinite norm (max) of past gradients.

In Adam, the update rule for individual weights is to scale their gradients inversely proportional to a (scaled) L^2 norm of their individual current and past gradients

— Adam: A Method for Stochastic Optimization, 2014.

Generally, AdaMax automatically adapts a separate step size (learning rate) for each parameter in the optimization problem.

Let’s step through each element of the algorithm.

First, we must maintain a moment vector and exponentially weighted infinity norm for each parameter being optimized as part of the search, referred to as *m* and *u* respectively.

They are initialized to 0.0 at the start of the search.

- m = 0
- u = 0

The algorithm is executed iteratively over time t starting at t=1, and each iteration involves calculating a new set of parameter values x, e.g. going from *x(t-1)* to *x(t)*.

It is perhaps easy to understand the algorithm if we focus on updating one parameter, which generalizes to updating all parameters via vector operations.

First, the gradient (partial derivatives) are calculated for the current time step.

- g(t) = f'(x(t-1))

Next, the moment vector is updated using the gradient and a hyperparameter *beta1*.

- m(t) = beta1 * m(t-1) + (1 – beta1) * g(t)

The exponentially weighted infinity norm is updated using the *beta2* hyperparameter.

- u(t) = max(beta2 * u(t-1), abs(g(t)))

Where *max()* selects the maximum of the parameters and *abs()* calculates the absolute value.

We can then update the parameter value. This can be broken down into three pieces; the first calculates the step size parameter, the second the gradient, and the third uses the step size and gradient to calculate the new parameter value.

Let’s start with calculating the step size for the parameter using an initial step size hyperparameter called *alpha* and a version of *beta1* that is decaying over time with a specific value for this time step *beta1(t)*:

- step_size(t) = alpha / (1 – beta1(t))

The gradient used for updating the parameter is calculated as follows:

- delta(t) = m(t) / u(t)

Finally, we can calculate the value for the parameter for this iteration.

- x(t) = x(t-1) – step_size(t) * delta(t)

Or the complete update equation can be stated as:

- x(t) = x(t-1) – (alpha / (1 – beta1(t))) * m(t) / u(t)

To review, there are three hyperparameters for the algorithm; they are:

**alpha**: Initial step size (learning rate), a typical value is 0.002.**beta1**: Decay factor for first momentum, a typical value is 0.9.**beta2**: Decay factor for infinity norm, a typical value is 0.999.

The decay schedule for beta1(t) suggested in the paper is calculated using the initial beta1 value raised to the power t, although other decay schedules could be used such as holding the value constant or decaying more aggressively.

- beta1(t) = beta1^t

And that’s it.

For the full derivation of the AdaMax algorithm in the context of the Adam algorithm, I recommend reading the paper:

Next, let’s look at how we might implement the algorithm from scratch in Python.

In this section, we will explore how to implement the gradient descent optimization algorithm with AdaMax Momentum.

First, let’s define an optimization function.

We will use a simple two-dimensional function that squares the input of each dimension and define the range of valid inputs from -1.0 to 1.0.

The *objective()* function below implements this.

# objective function def objective(x, y): return x**2.0 + y**2.0

We can create a three-dimensional plot of the dataset to get a feeling for the curvature of the response surface.

The complete example of plotting the objective function is listed below.

# 3d plot of the test function from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input r_min, r_max = -1.0, 1.0 # sample input range uniformly at 0.1 increments xaxis = arange(r_min, r_max, 0.1) yaxis = arange(r_min, r_max, 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a surface plot with the jet color scheme figure = pyplot.figure() axis = figure.gca(projection='3d') axis.plot_surface(x, y, results, cmap='jet') # show the plot pyplot.show()

Running the example creates a three-dimensional surface plot of the objective function.

We can see the familiar bowl shape with the global minima at f(0, 0) = 0.

We can also create a two-dimensional plot of the function. This will be helpful later when we want to plot the progress of the search.

The example below creates a contour plot of the objective function.

# contour plot of the test function from numpy import asarray from numpy import arange from numpy import meshgrid from matplotlib import pyplot # objective function def objective(x, y): return x**2.0 + y**2.0 # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # show the plot pyplot.show()

Running the example creates a two-dimensional contour plot of the objective function.

We can see the bowl shape compressed to contours shown with a color gradient. We will use this plot to plot the specific points explored during the progress of the search.

Now that we have a test objective function, let’s look at how we might implement the AdaMax optimization algorithm.

We can apply the gradient descent with AdaMax to the test problem.

First, we need a function that calculates the derivative for this function.

The derivative of x^2 is x * 2 in each dimension.

- f(x) = x^2
- f'(x) = x * 2

The derivative() function implements this below.

# derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0])

Next, we can implement gradient descent optimization with AdaMax.

First, we can select a random point in the bounds of the problem as a starting point for the search.

This assumes we have an array that defines the bounds of the search with one row for each dimension and the first column defines the minimum and the second column defines the maximum of the dimension.

... # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])

Next, we need to initialize the moment vector and exponentially weighted infinity norm.

... # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])]

We then run a fixed number of iterations of the algorithm defined by the “*n_iter*” hyperparameter.

... # run iterations of gradient descent for t in range(n_iter): ...

The first step is to calculate the derivative for the current set of parameters.

... # calculate gradient g(t) g = derivative(x[0], x[1])

Next, we need to perform the AdaMax update calculations. We will perform these calculations one variable at a time using an imperative programming style for readability.

In practice, I recommend using NumPy vector operations for efficiency.

... # build a solution one variable at a time for i in range(x.shape[0]): ...

First, we need to calculate the moment vector.

... # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i]

Next, we need to calculate the exponentially weighted infinity norm.

... # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i]))

Then the step size used in the update.

... # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1))

And the change in variable.

... # delta(t) = m(t) / u(t) delta = m[i] / u[i]

Finally, we can calculate the new value for the variable.

... # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta

This is then repeated for each parameter that is being optimized.

At the end of the iteration, we can evaluate the new parameter values and report the performance of the search.

... # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score))

We can tie all of this together into a function named *adamax()* that takes the names of the objective and derivative functions as well as the algorithm hyperparameters and returns the best solution found at the end of the search and its evaluation.

# gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score]

We can then define the bounds of the function and the hyperparameters and call the function to perform the optimization.

In this case, we will run the algorithm for 60 iterations with an initial learning rate of 0.02, beta of 0.8, and a beta2 of 0.99, found after a little trial and error.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

At the end of the run, we will report the best solution found.

... # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Tying all of this together, the complete example of AdaMax gradient descent applied to our test problem is listed below.

# gradient descent optimization with adamax for a two-dimensional test function from numpy import asarray from numpy.random import rand from numpy.random import seed # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return [x, score] # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax best, score = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # summarize the result print('Done!') print('f(%s) = %f' % (best, score))

Running the example applies the optimization algorithm with AdaMax to our test problem and reports the performance of the search for each iteration of the algorithm.

**Note**: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that a near-optimal solution was found after perhaps 35 iterations of the search, with input values near 0.0 and 0.0, evaluating to 0.0.

... >33 f([-0.00122185 0.00427944]) = 0.00002 >34 f([-0.00045147 0.00289913]) = 0.00001 >35 f([0.00022176 0.00165754]) = 0.00000 >36 f([0.00073314 0.00058534]) = 0.00000 >37 f([ 0.00105092 -0.00030082]) = 0.00000 >38 f([ 0.00117382 -0.00099624]) = 0.00000 >39 f([ 0.00112512 -0.00150609]) = 0.00000 >40 f([ 0.00094497 -0.00184321]) = 0.00000 >41 f([ 0.00068206 -0.002026 ]) = 0.00000 >42 f([ 0.00038579 -0.00207647]) = 0.00000 >43 f([ 9.99977780e-05 -2.01849176e-03]) = 0.00000 >44 f([-0.00014145 -0.00187632]) = 0.00000 >45 f([-0.00031698 -0.00167338]) = 0.00000 >46 f([-0.00041753 -0.00143134]) = 0.00000 >47 f([-0.00044531 -0.00116942]) = 0.00000 >48 f([-0.00041125 -0.00090399]) = 0.00000 >49 f([-0.00033193 -0.00064834]) = 0.00000 Done! f([-0.00033193 -0.00064834]) = 0.000001

We can plot the progress of the AdaMax search on a contour plot of the domain.

This can provide an intuition for the progress of the search over the iterations of the algorithm.

We must update the adamax() function to maintain a list of all solutions found during the search, then return this list at the end of the search.

The updated version of the function with these changes is listed below.

# gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions

We can then execute the search as before, and this time retrieve the list of solutions instead of the best final solution.

... # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax solutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2)

We can then create a contour plot of the objective function, as before.

... # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet')

Finally, we can plot each solution found during the search as a white dot connected by a line.

... # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')

Tying this all together, the complete example of performing the AdaMax optimization on the test problem and plotting the results on a contour plot is listed below.

# example of plotting the adamax search on a contour plot of the test function from numpy import asarray from numpy import arange from numpy.random import rand from numpy.random import seed from numpy import meshgrid from matplotlib import pyplot from mpl_toolkits.mplot3d import Axes3D # objective function def objective(x, y): return x**2.0 + y**2.0 # derivative of objective function def derivative(x, y): return asarray([x * 2.0, y * 2.0]) # gradient descent algorithm with adamax def adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2): solutions = list() # generate an initial point x = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0]) # initialize moment vector and weighted infinity norm m = [0.0 for _ in range(bounds.shape[0])] u = [0.0 for _ in range(bounds.shape[0])] # run iterations of gradient descent for t in range(n_iter): # calculate gradient g(t) g = derivative(x[0], x[1]) # build a solution one variable at a time for i in range(x.shape[0]): # m(t) = beta1 * m(t-1) + (1 - beta1) * g(t) m[i] = beta1 * m[i] + (1.0 - beta1) * g[i] # u(t) = max(beta2 * u(t-1), abs(g(t))) u[i] = max(beta2 * u[i], abs(g[i])) # step_size(t) = alpha / (1 - beta1(t)) step_size = alpha / (1.0 - beta1**(t+1)) # delta(t) = m(t) / u(t) delta = m[i] / u[i] # x(t) = x(t-1) - step_size(t) * delta(t) x[i] = x[i] - step_size * delta # evaluate candidate point score = objective(x[0], x[1]) solutions.append(x.copy()) # report progress print('>%d f(%s) = %.5f' % (t, x, score)) return solutions # seed the pseudo random number generator seed(1) # define range for input bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]]) # define the total iterations n_iter = 60 # steps size alpha = 0.02 # factor for average gradient beta1 = 0.8 # factor for average squared gradient beta2 = 0.99 # perform the gradient descent search with adamax solutions = adamax(objective, derivative, bounds, n_iter, alpha, beta1, beta2) # sample input range uniformly at 0.1 increments xaxis = arange(bounds[0,0], bounds[0,1], 0.1) yaxis = arange(bounds[1,0], bounds[1,1], 0.1) # create a mesh from the axis x, y = meshgrid(xaxis, yaxis) # compute targets results = objective(x, y) # create a filled contour plot with 50 levels and jet color scheme pyplot.contourf(x, y, results, levels=50, cmap='jet') # plot the sample as black circles solutions = asarray(solutions) pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w') # show the plot pyplot.show()

Running the example performs the search as before, except in this case, the contour plot of the objective function is created.

In this case, we can see that a white dot is shown for each solution found during the search, starting above the optima and progressively getting closer to the optima at the center of the plot.

This section provides more resources on the topic if you are looking to go deeper.

- Adam: A Method for Stochastic Optimization, 2014.
- An Overview Of Gradient Descent Optimization Algorithms, 2016.

- Algorithms for Optimization, 2019.
- Deep Learning, 2016.

- Gradient descent, Wikipedia.
- Stochastic gradient descent, Wikipedia.
- An overview of gradient descent optimization algorithms, 2016.

In this tutorial, you discovered how to develop the gradient descent optimization with AdaMax from scratch.

Specifically, you learned:

- AdaMax is an extension of the Adam version of gradient descent designed to accelerate the optimization process.
- How to implement the AdaMax optimization algorithm from scratch and apply it to an objective function and evaluate the results.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Gradient Descent Optimization With AdaMax From Scratch appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Premature Convergence appeared first on Machine Learning Mastery.

]]>It can also be a useful empirical tool when exploring the learning dynamics of an optimization algorithm, and machine learning algorithms trained using an optimization algorithm, such as deep learning neural networks. This motivates the investigation of learning curves and techniques, such as early stopping.

If optimization is a process that generates candidate solutions, then convergence represents a stable point at the end of the process when no further changes or improvements are expected. **Premature convergence** refers to a failure mode for an optimization algorithm where the process stops at a stable point that does not represent a globally optimal solution.

In this tutorial, you will discover a gentle introduction to premature convergence in machine learning.

After completing this tutorial, you will know:

- Convergence refers to the stable point found at the end of a sequence of solutions via an iterative optimization algorithm.
- Premature convergence refers to a stable point found too soon, perhaps close to the starting point of the search, and with a worse evaluation than expected.
- Greediness of an optimization algorithm provides a control over the rate of convergence of an algorithm.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial is divided into three parts; they are:

- Convergence in Machine Learning
- Premature Convergence
- Addressing Premature Convergence

Convergence generally refers to the values of a process that have a tendency in behavior over time.

It is a useful idea when working with optimization algorithms.

Optimization refers to a type of problem that requires finding a set of inputs that result in the maximum or minimum value from an objective function. Optimization is an iterative process that produces a sequence of candidate solutions until ultimately arriving upon a final solution at the end of the process.

This behavior or dynamics of the optimization algorithm arriving on a stable-point final solution is referred to as convergence, e.g. the convergence of the optimization algorithms. In this way, convergence defines the termination of the optimization algorithm.

Local descent involves iteratively choosing a descent direction and then taking a step in that direction and repeating that process until convergence or some termination condition is met.

— Page 13, Algorithms for Optimization, 2019.

**Convergence**: Stop condition for an optimization algorithm where a stable point is located and further iterations of the algorithm are unlikely to result in further improvement.

We might measure and explore the convergence of an optimization algorithm empirically, such as using learning curves. Additionally, we might also explore the convergence of an optimization algorithm analytically, such as a convergence proof or average case computational complexity.

Strong selection pressure results in rapid, but possibly premature, convergence. Weakening the selection pressure slows down the search process …

— Page 78, Evolutionary Computation: A Unified Approach, 2002.

Optimization, and the convergence of optimization algorithms, is an important concept in machine learning for those algorithms that fit (learn) on a training dataset via an iterative optimization algorithm, such as logistic regression and artificial neural networks.

As such, we may choose optimization algorithms that result in better convergence behavior than other algorithms, or spend a lot of time tuning the convergence dynamics (learning dynamics) of an optimization algorithm via the hyperparameters of the optimization (e.g. learning rate).

Convergence behavior can be compared, often in terms of the number of iterations of an algorithm required until convergence, to the objective function evaluation of the stable point found at convergence, and combinations of these concerns.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Premature convergence refers to the convergence of a process that has occurred too soon.

In optimization, it refers to the algorithm converging upon a stable point that has worse performance than expected.

Premature convergence typically afflicts complex optimization tasks where the objective function is non-convex, meaning that the response surface contains many different good solutions (stable points), perhaps with one (or a few) best solutions.

If we consider the response surface of an objective function under optimization as a geometrical landscape and we are seeking a minimum of the function, then premature optimization refers to finding a valley close to the starting point of the search that has less depth than the deepest valley in the problem domain.

For problems that exhibit highly multi-modal (rugged) fitness landscapes or landscapes that change over time, too much exploitation generally results in premature convergence to suboptimal peaks in the space.

— Page 60, Evolutionary Computation: A Unified Approach, 2002.

In this way, premature convergence is described as finding a locally optimal solution instead of the globally optimal solution for an optimization algorithm. It is a specific failure case for an optimization algorithm.

**Premature Convergence**: Convergence of an optimization algorithm to a worse than optimal stable point that is likely close to the starting point.

Put another way, convergence signifies the end of the search process, e.g. a stable point was located and further iterations of the algorithm are not likely to improve upon the solution. Premature convergence refers to reaching this stop condition of an optimization algorithm at a less than desirable stationary point.

Premature convergence may be a relevant concern on any reasonably challenging optimization task.

For example, a majority of research into the field of evolutionary computation and genetic algorithms involves identifying and overcoming the premature convergence of the algorithm on an optimization task.

If selection focuses on the most-fit individuals, the selection pressure may cause premature convergence due to reduced diversity of the new populations.

— Page 139, Computational Intelligence: An Introduction, 2nd edition, 2007.

Population-based optimization algorithms, like evolutionary algorithms and swarm intelligence, often describe their dynamics in terms of the interplay between selective pressures and convergence. For example, strong selective pressures result in faster convergence and likely premature convergence. Weaker selective pressures may result in a slower convergence (greater computational cost) although perhaps locate a better or even global optima.

An operator with a high selective pressure decreases diversity in the population more rapidly than operators with a low selective pressure, which may lead to premature convergence to suboptimal solutions. A high selective pressure limits the exploration abilities of the population.

— Page 135, Computational Intelligence: An Introduction, 2nd edition, 2007.

This idea of selective pressure is helpful more generally in understanding the learning dynamics of optimization algorithms. For example, an optimization that is configured to be too greedy (e.g. via hyperparameters such as the step size or learning rate) may fail due to premature convergence, whereas the same algorithm that is configured to be less greedy may overcome premature convergence and discover a better or globally optimal solution.

Premature convergence may be encountered when using stochastic gradient descent to train a neural network model, signified by a learning curve that drops exponentially quickly then stops improving.

The number of updates required to reach convergence usually increases with training set size. However, as m approaches infinity, the model will eventually converge to its best possible test error before SGD has sampled every example in the training set.

— Page 153, Deep Learning, 2016.

The fact that fitting neural networks are subject to premature convergence motivates the use of methods such as learning curves to monitor and diagnose issues with the convergence of a model on a training dataset, and the use of regularization, such as early stopping, that halts the optimization algorithm prior to finding a stable point comes at the expense of worse performance on a holdout dataset.

As such, much research into deep learning neural networks is ultimately directed at overcoming premature convergence.

Empirically, it is often found that ‘tanh’ activation functions give rise to faster convergence of training algorithms than logistic functions.

— Page 127, Neural Networks for Pattern Recognition, 1995.

This includes techniques such as work on weight initialization, which is critical because the initial weights of a neural network define the starting point of the optimization process, and poor initialization can lead to premature convergence.

The initial point can determine whether the algorithm converges at all, with some initial points being so unstable that the algorithm encounters numerical difficulties and fails altogether.

— Page 301, Deep Learning, 2016.

This also includes the vast number of variations and extensions of the stochastic gradient descent optimization algorithm, such as the addition of momentum so that the algorithm does not overshoot the optima (stable point), and Adam that adds an automatically adapted step size hyperparameter (learning rate) for each parameter that is being optimized, dramatically speeding up convergence.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- An Introduction to Genetic Algorithms, 1998.
- Computational Intelligence: An Introduction, 2nd edition, 2007.
- Deep Learning, 2016.
- Evolutionary Computation: A Unified Approach, 2002.
- Neural Networks for Pattern Recognition, 1995.
- Probabilistic Machine Learning: An Introduction 2020.

- Limit of a sequence, Wikipedia.
- Convergence of random variables, Wikipedia.
- Premature convergence, Wikipedia.

In this tutorial, you discovered a gentle introduction to premature convergence in machine learning.

Specifically, you learned:

- Convergence refers to the stable point found at the end of a sequence of solutions via an iterative optimization algorithm.
- Premature convergence refers to a stable point found too soon, perhaps close to the starting point of the search, and with a worse evaluation than expected.
- Greediness of an optimization algorithm provides a control over the rate of convergence of an algorithm.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Premature Convergence appeared first on Machine Learning Mastery.

]]>The post Why Optimization Is Important in Machine Learning appeared first on Machine Learning Mastery.

]]>This problem can be described as approximating a function that maps examples of inputs to examples of outputs. Approximating a function can be solved by framing the problem as function optimization. This is where a machine learning algorithm defines a parameterized mapping function (e.g. a weighted sum of inputs) and an optimization algorithm is used to fund the values of the parameters (e.g. model coefficients) that minimize the error of the function when used to map inputs to outputs.

This means that each time we fit a machine learning algorithm on a training dataset, we are solving an optimization problem.

In this tutorial, you will discover the central role of optimization in machine learning.

After completing this tutorial, you will know:

- Machine learning algorithms perform function approximation, which is solved using function optimization.
- Function optimization is the reason why we minimize error, cost, or loss when fitting a machine learning algorithm.
- Optimization is also performed during data preparation, hyperparameter tuning, and model selection in a predictive modeling project.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial is divided into three parts; they are:

- Machine Learning and Optimization
- Learning as Optimization
- Optimization in a Machine Learning Project
- Data Preparation as Optimization
- Hyperparameter Tuning as Optimization
- Model Selection as Optimization

Function optimization is the problem of finding the set of inputs to a target objective function that result in the minimum or maximum of the function.

It can be a challenging problem as the function may have tens, hundreds, thousands, or even millions of inputs, and the structure of the function is unknown, and often non-differentiable and noisy.

**Function Optimization**: Find the set of inputs that results in the minimum or maximum of an objective function.

Machine learning can be described as function approximation. That is, approximating the unknown underlying function that maps examples of inputs to outputs in order to make predictions on new data.

It can be challenging as there is often a limited number of examples from which we can approximate the function, and the structure of the function that is being approximated is often nonlinear, noisy, and may even contain contradictions.

**Function Approximation**: Generalize from specific examples to a reusable mapping function for making predictions on new examples.

Function optimization is often simpler than function approximation.

Importantly, in machine learning, we often solve the problem of function approximation using function optimization.

At the core of nearly all machine learning algorithms is an optimization algorithm.

In addition, the process of working through a predictive modeling problem involves optimization at multiple steps in addition to learning a model, including:

- Choosing the hyperparameters of a model.
- Choosing the transforms to apply to the data prior to modeling
- Choosing the modeling pipeline to use as the final model.

Now that we know that optimization plays a central role in machine learning, let’s look at some examples of learning algorithms and how they use optimization.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

Predictive modeling problems involve making a prediction from an example of input.

A numeric quantity must be predicted in the case of a regression problem, whereas a class label must be predicted in the case of a classification problem.

The problem of predictive modeling is sufficiently challenging that we cannot write code to make predictions. Instead, we must use a learning algorithm applied to historical data to learn a “*program*” called a predictive model that we can use to make predictions on new data.

In statistical learning, a statistical perspective on machine learning, the problem is framed as the learning of a mapping function (f) given examples of input data (*X*) and associated output data (*y*).

- y = f(X)

Given new examples of input (*Xhat*), we must map each example onto the expected output value (*yhat*) using our learned function (*fhat*).

- yhat = fhat(Xhat)

The learned mapping will be imperfect. No model is perfect, and some prediction error is expected given the difficulty of the problem, noise in the observed data, and the choice of learning algorithm.

Mathematically, learning algorithms solve the problem of approximating the mapping function by solving a function optimization problem.

Specifically, given examples of inputs and outputs, find the set of inputs to the mapping function that results in the minimum loss, minimum cost, or minimum prediction error.

The more biased or constrained the choice of mapping function, the easier the optimization is to solve.

Let’s look at some examples to make this clear.

A linear regression (for regression problems) is a highly constrained model and can be solved analytically using linear algebra. The inputs to the mapping function are the coefficients of the model.

We can use an optimization algorithm, like a quasi-Newton local search algorithm, but it will almost always be less efficient than the analytical solution.

**Linear Regression**: Function inputs are model coefficients, optimization problems that can be solved analytically.

A logistic regression (for classification problems) is slightly less constrained and must be solved as an optimization problem, although something about the structure of the optimization function being solved is known given the constraints imposed by the model.

This means a local search algorithm like a quasi-Newton method can be used. We could use a global search like stochastic gradient descent, but it will almost always be less efficient.

**Logistic Regression**: Function inputs are model coefficients, optimization problems that require an iterative local search algorithm.

A neural network model is a very flexible learning algorithm that imposes few constraints. The inputs to the mapping function are the network weights. A local search algorithm cannot be used given the search space is multimodal and highly nonlinear; instead, a global search algorithm must be used.

A global optimization algorithm is commonly used, specifically stochastic gradient descent, and the updates are made in a way that is aware of the structure of the model (backpropagation and the chain rule). We could use a global search algorithm that is oblivious of the structure of the model, like a genetic algorithm, but it will almost always be less efficient.

**Neural Network**: Function inputs are model weights, optimization problems that require an iterative global search algorithm.

We can see that each algorithm makes different assumptions about the form of the mapping function, which influences the type of optimization problem to be solved.

We can also see that the default optimization algorithm used for each machine learning algorithm is not arbitrary; it represents the most efficient algorithm for solving the specific optimization problem framed by the algorithm, e.g. stochastic gradient descent for neural nets instead of a genetic algorithm. Deviating from these defaults requires a good reason.

Not all machine learning algorithms solve an optimization problem. A notable example is the k-nearest neighbors algorithm that stores the training dataset and does a lookup for the k best matches to each new example in order to make a prediction.

Now that we are familiar with learning in machine learning algorithms as optimization, let’s look at some related examples of optimization in a machine learning project.

Optimization plays an important part in a machine learning project in addition to fitting the learning algorithm on the training dataset.

The step of preparing the data prior to fitting the model and the step of tuning a chosen model also can be framed as an optimization problem. In fact, an entire predictive modeling project can be thought of as one large optimization problem.

Let’s take a closer look at each of these cases in turn.

Data preparation involves transforming raw data into a form that is most appropriate for the learning algorithms.

This might involve scaling values, handling missing values, and changing the probability distribution of variables.

Transforms can be made to change representation of the historical data to meet the expectations or requirements of specific learning algorithms. Yet, sometimes good or best results can be achieved when the expectations are violated or when an unrelated transform to the data is performed.

We can think of choosing transforms to apply to the training data as a search or optimization problem of best exposing the unknown underlying structure of the data to the learning algorithm.

**Data Preparation**: Function inputs are sequences of transforms, optimization problems that require an iterative global search algorithm.

This optimization problem is often performed manually with human-based trial and error. Nevertheless, it is possible to automate this task using a global optimization algorithm where the inputs to the function are the types and order of transforms applied to the training data.

The number and permutations of data transforms are typically quite limited and it may be possible to perform an exhaustive search or a grid search of commonly used sequences.

For more on this topic, see the tutorial:

Machine learning algorithms have hyperparameters that can be configured to tailor the algorithm to a specific dataset.

Although the dynamics of many hyperparameters are known, the specific effect they will have on the performance of the resulting model on a given dataset is not known. As such, it is a standard practice to test a suite of values for key algorithm hyperparameters for a chosen machine learning algorithm.

This is called hyperparameter tuning or hyperparameter optimization.

It is common to use a naive optimization algorithm for this purpose, such as a random search algorithm or a grid search algorithm.

**Hyperparameter Tuning**: Function inputs are algorithm hyperparameters, optimization problems that require an iterative global search algorithm.

For more on this topic, see the tutorial:

Nevertheless, it is becoming increasingly common to use an iterative global search algorithm for this optimization problem. A popular choice is a Bayesian optimization algorithm that is capable of simultaneously approximating the target function that is being optimized (using a surrogate function) while optimizing it.

This is desirable as evaluating a single combination of model hyperparameters is expensive, requiring fitting the model on the entire training dataset one or many times, depending on the choice of model evaluation procedure (e.g. repeated k-fold cross-validation).

For more on this topic, see the tutorial:

Model selection involves choosing one from among many candidate machine learning models for a predictive modeling problem.

Really, it involves choosing the machine learning algorithm or machine learning pipeline that produces a model. This is then used to train a final model that can then be used in the desired application to make predictions on new data.

This process of model selection is often a manual process performed by a machine learning practitioner involving tasks such as preparing data, evaluating candidate models, tuning well-performing models, and finally choosing the final model.

This can be framed as an optimization problem that subsumes part of or the entire predictive modeling project.

**Model Selection**: Function inputs are data transform, machine learning algorithm, and algorithm hyperparameters; optimization problem that requires an iterative global search algorithm.

Increasingly, this is the case with automated machine learning (AutoML) algorithms being used to choose an algorithm, an algorithm and hyperparameters, or data preparation, algorithm and hyperparameters, with very little user intervention.

For more on AutoML see the tutorial:

Like hyperparameter tuning, it is common to use a global search algorithm that also approximates the objective function, such as Bayesian optimization, given that each function evaluation is expensive.

This automated optimization approach to machine learning also underlies modern machine learning as a service (MLaaS) products provided by companies such as Google, Microsoft, and Amazon.

Although fast and efficient, such approaches are still unable to outperform hand-crafted models prepared by highly skilled experts, such as those participating in machine learning competitions.

This section provides more resources on the topic if you are looking to go deeper.

- A Gentle Introduction to Applied Machine Learning as a Search Problem
- How to Grid Search Data Preparation Techniques
- Hyperparameter Optimization With Random Search and Grid Search
- How to Implement Bayesian Optimization from Scratch in Python
- Automated Machine Learning (AutoML) Libraries for Python

- Mathematical optimization, Wikipedia.
- Function approximation, Wikipedia.
- Least-squares function approximation, Wikipedia.
- Hyperparameter optimization, Wikipedia.
- Model selection, Wikipedia.

In this tutorial, you discovered the central role of optimization in machine learning.

Specifically, you learned:

- Machine learning algorithms perform function approximation, which is solved using function optimization.
- Function optimization is the reason why we minimize error, cost, or loss when fitting a machine learning algorithm.
- Optimization is also performed during data preparation, hyperparameter tuning, and model selection in a predictive modeling project.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post Why Optimization Is Important in Machine Learning appeared first on Machine Learning Mastery.

]]>The post A Gentle Introduction to Function Optimization appeared first on Machine Learning Mastery.

]]>Importantly, function optimization is central to almost all machine learning algorithms, and predictive modeling projects. As such, it is critical to understand what function optimization is, the terminology used in the field, and the elements that constitute a function optimization problem.

In this tutorial, you will discover a gentle introduction to function optimization.

After completing this tutorial, you will know:

- The three elements of function optimization as candidate solutions, objective functions, and cost.
- The conceptualization of function optimization as navigating a search space and response surface.
- The difference between global optima and local optima when solving a function optimization problem.

**Kick-start your project** with my new book Optimization for Machine Learning, including *step-by-step tutorials* and the *Python source code* files for all examples.

This tutorial is divided into four parts; they are:

- Function Optimization
- Candidate Solutions
- Objective Functions
- Evaluation Costs

Function optimization is a subfield of mathematics, and in modern times is addressed using numerical computing methods.

**Continuous function optimization** (“*function optimization*” here for short) belongs to a broader field of study called mathematical optimization.

It is distinct from other types of optimization as it involves finding optimal candidate solutions composed of numeric input variables, as opposed to candidate solutions composed of sequences or combinations (e.g. combinatorial optimization).

Function optimization is a widely used tool bag of techniques employed in practically all scientific and engineering disciplines.

People optimize. Investors seek to create portfolios that avoid excessive risk while achieving a high rate of return. […] Optimization is an important tool in decision science and in the analysis of physical systems.

— Page 2, Numerical Optimization, 2006.

It plays a central role in machine learning, as almost all machine learning algorithms use function optimization to fit a model to a training dataset.

For example, fitting a line to a collection of points requires solving an optimization problem. As does fitting a linear regression or a neural network model on a training dataset.

In this way, optimization provides a tool to adapt a general model to a specific situation. Learning is treated as an optimization or search problem.

Practically, function optimization describes a class of problems for finding the input to a given function that results in the minimum or maximum output from the function.

The objective depends on certain characteristics of the system, called variables or unknowns. Our goal is to find values of the variables that optimize the objective.

— Page 2, Numerical Optimization, 2006.

Function Optimization involves three elements: the input to the function (e.g. *x*), the objective function itself (e.g. *f()*) and the output from the function (e.g. *cost*).

**Input (x)**: The input to the function to be evaluated, e.g. a candidate solution.**Function (f())**: The objective function or target function that evaluates inputs.**Cost**: The result of evaluating a candidate solution with the objective function, minimized or maximized.

Let’s take a closer look at each element in turn.

Take my free 7-day email crash course now (with sample code).

Click to sign-up and also get a free PDF Ebook version of the course.

A candidate solution is a single input to the objective function.

The form of a candidate solution depends on the specifics of the objective function. It may be a single floating point number, a vector of numbers, a matrix of numbers, or as complex as needed for the specific problem domain.

Most commonly, vectors of numbers. For a test problem, the vector represents the specific values of each input variable to the function (*x = x1, x2, x3, …, xn*). For a machine learning model, the vector may represent model coefficients or weights.

Mathematically speaking, optimization is the minimization or maximization of a function subject to constraints on its variables.

— Page 2, Numerical Optimization, 2006.

There may be constraints imposed by the problem domain or the objective function on the candidate solutions. This might include aspects such as:

- The number of variables (1, 20, 1,000,000, etc.)
- The data type of variables (integer, binary, real-valued, etc.)
- The range of accepted values (between 0 and 1, etc.)

Importantly, candidate solutions are discrete and there are many of them.

The universe of candidate solutions may be vast, too large to enumerate. Instead, the best we can do is sample candidate solutions in the search space. As a practitioner, we seek an optimization algorithm that makes the best use of the information available about the problem to effectively sample the search space and locate a good or best candidate solution.

**Search Space**: Universe of candidate solutions defined by the number, type, and range of accepted inputs to the objective function.

Finally, candidate solutions can be rank-ordered based on their evaluation by the objective function, meaning that some are better than others.

The objective function is specific to the problem domain.

It may be a test function, e.g. a well-known equation with a specific number of input variables, the calculation of which returns the cost of the input. The optima of test functions are known, allowing algorithms to be compared based on their ability to navigate the search space efficiently.

In machine learning, the objective function may involve plugging the candidate solution into a model and evaluating it against a portion of the training dataset, and the cost may be an error score, often called the loss of the model.

The objective function is easy to define, although expensive to evaluate. Efficiency in function optimization refers to minimizing the total number of function evaluations.

Although the objective function is easy to define, it may be challenging to optimize. The difficulty of an objective function may range from being able to analytically solve the function directly using calculus or linear algebra (easy), to using a local search algorithm (moderate), to using a global search algorithm (hard).

The difficulty of an objective function is based on how much is known about the function. This often cannot be determined by simply reviewing the equation or code for evaluating candidate solutions. Instead, it refers to the structure of the response surface.

The response surface (or search landscape) is the geometrical structure of the cost in relation to the search space of candidate solutions. For example, a smooth response surface suggests that small changes to the input (candidate solutions) result in small changes to the output (cost) from the objective function.

**Response Surface**: Geometrical properties of the cost from the objective function in response to changes to the candidate solutions.

The response surface can be visualized in low dimensions, e.g. for candidate solutions with one or two input variables. A one-dimensional input can be plotted as a 2D scatter plot with input values on the x-axis and the cost on the y-axis. A two-dimensional input can be plotted as a 3D surface plot with input variables on the x and y-axis, and the height of the surface representing the cost.

In a minimization problem, poor solutions would be represented as hills in the response surface and good solutions would be represented by valleys. This would be inverted for maximizing problems.

The structure and shape of this response surface determine the difficulty an algorithm will have in navigating the search space to a solution.

The complexity of real objective functions means we cannot analyze the surface analytically, and the high dimensionality of the inputs and computational cost of function evaluations makes mapping and plotting real objective functions intractable.

The cost of a candidate solution is almost always a single real-valued number.

The scale of the cost values will vary depending on the specifics of the objective function. In general, the only meaningful comparison of cost values is to other cost values calculated by the same objective function.

The minimum or maximum output from the function is called the optima of the function, typically simplified to simply the minimum. Any function we wish to maximize can be converted to minimizing by adding a negative sign to the front of the cost returned from the function.

In global optimization, the true global solution of the optimization problem is found; the compromise is efficiency. The worst-case complexity of global optimization methods grows exponentially with the problem sizes …

— Page 10, Convex Optimization, 2004.

An objective function may have a single best solution, referred to as the global optimum of the objective function. Alternatively, the objective function may have many global optima, in which case we may be interested in locating one or all of them.

Many numerical optimization methods seek local minima. Local minima are locally optimal, but we do not generally know whether a local minimum is a global minimum.

— Page 8, Algorithms for Optimization, 2019.

In addition to a global optima, a function may have local optima, which are good candidate solutions that may be relatively easy to locate, but not as good as the global optima. Local optima may appear to be global optima to a search algorithm, e.g. may be in a valley of the response surface, in which case we might refer to them as deceptive as the algorithm will easily locate them and get stuck, failing to locate the global optima.

**Global Optima**: The candidate solution with the best cost from the objective function.**Local Optima**. Candidate solutions are good but not as good as the global optima.

The relative nature of cost values means that a baseline in performance on challenging problems can be established using a naive search algorithm (e.g. random) and “goodness” of optimal solutions found by more sophisticated search algorithms can be compared relative to the baseline.

Candidate solutions are often very simple to describe and very easy to construct. The challenging part of function optimization is evaluating candidate solutions.

Solving a function optimization problem or objective function refers to finding the optima. The whole goal of the project is to locate a specific candidate solution with a good or best cost, give the time and resources available. In simple and moderate problems, we may be able to locate the optimal candidate solution exactly and have some confidence that we have done so.

Many algorithms for nonlinear optimization problems seek only a local solution, a point at which the objective function is smaller than at all other feasible nearby points. They do not always find the global solution, which is the point with lowest function value among all feasible points. Global solutions are needed in some applications, but for many problems they are difficult to recognize and even more difficult to locate.

— Page 6, Numerical Optimization, 2006.

On more challenging problems, we may be happy with a relatively good candidate solution (e.g. good enough) given the time available for the project.

This section provides more resources on the topic if you are looking to go deeper.

- Algorithms for Optimization, 2019.
- Convex Optimization, 2004.
- Numerical Optimization, 2006.

In this tutorial, you discovered a gentle introduction to function optimization.

Specifically, you learned:

- The three elements of function optimization as candidate solutions, objective functions and cost.
- The conceptualization of function optimization as navigating a search space and response surface.
- The difference between global optima and local optima when solving a function optimization problem.

**Do you have any questions?**

Ask your questions in the comments below and I will do my best to answer.

The post A Gentle Introduction to Function Optimization appeared first on Machine Learning Mastery.

]]>