# What exactly is it that Adam Optimizer can do?

Good results can be achieved in minutes, hours, or days depending on the optimization strategy you use for your deep learning model. Adam optimizer is an optimization technique that has lately received widespread use in deep learning applications like computer vision and natural language processing.

If you’re interested in deep learning, you’ll find a light introduction to the adam optimizer technique here.

You will gain knowledge after reading this post about:

- Why and how the Adam algorithm can help you fine-tune your models.
- The inner workings of the Adam algorithm and how it differs from AdaGrad and RMSProp, two related approaches.
- Commonly employed settings and methods for configuring the Adam algorithm.

Okay, so let’s get going.

**The Adam algorithm for optimizing what?**

When training a network, adam optimizer can be used as an alternative to the traditional stochastic gradient descent process to provide iterative updates to the network’s weights based on the data.

In their poster at the 2015 ICLR conference, Diederik Kingma of OpenAI and Jimmy Ba of the University of Toronto introduced Adam, an approach to stochastic optimization. Unless otherwise noted, the majority of the text in this post consists of direct quotes from their article.

The developers of adam optimizer provide a list of its appealing features while introducing the approach for solving non-convex optimization problems:

- Simple to put into action.
- Effectively uses computer resources.
- There isn’t much of a need to remember anything.
- resist changes in gradient magnitude due to a transformation along the diagonal.
- excellent for issues with a lot of variables and/or data.
- Suitable for aims that are not fixed in place.
- Suitable for situations when gradient information is either scarce or highly noisy.
- Hyper-parameters are easy to grasp and rarely need adjusting.

**Explain Adam’s Mechanisms, Please.**

Adam deviates from traditional stochastic gradient descent in important ways.

For all weight updates in stochastic gradient descent, the learning rate (called alpha) remains constant during the training process.

Each network weight (parameter) has its learning rate that is tracked and adjusted independently as training progresses.

Adam is described by the authors as bringing together the best features of two different stochastic gradient descent variants. Specifically:

**Variant are:**

**Adaptive Gradient Algorithm**(AGA) that keeps a per-parameter learning rate and boosts efficiency in sparse gradient situations (e.g. natural language and computer vision problems).**Root Mean Square Propagation**, keeps parameter-specific learning rates flexible by averaging the weight gradient’s size over recent iterations (e.g. how quickly it is changing). Therefore, the method is well-suited to online and non-stationary issues (e.g. noise).

adam optimizer sees the value in AdaGrad and RMSProp.

Adam uses the average of the second moments of the gradients in addition to the average of the first moments to adjust the learning rates of the parameters (the uncentered variance).

The parameters beta1 and beta2 determine the decay rates of the exponential moving averages of the gradient and the squared gradient that are calculated by the algorithm.

Moment estimations are skewed toward zero when the recommended beginning value of the moving averages is used along with beta1 and beta2 values close to 1.0. The skewed estimates are calculated first, and then the skewed estimates are corrected for bias.

**Effectiveness in Action: Adam**

adam optimizer has gained a lot of traction in the deep learning community due to its ability to produce high-quality results rapidly.

Convergence was empirically shown to occur in the original work, supporting the theoretical approach. Adam was used with the Multilayer Perceptron method on the MNIST dataset, the Convolutional Neural Networks technique on the CIFAR-10 image recognition dataset, and the logistic regression algorithm on MNIST and IMDB sentiment analysis.

**Adam’s Intuition**

To fix AdaGrad’s denominator decay issue, we can just as easily say to do everything that RMSProp does. In addition, make use of the fact that adam optimizer uses a cumulative history of gradients to achieve their optimization goals.

Below is Adam’s updated policy.

The update strategy for the adam optimizer is very similar to that of the RMSprop optimizer, which you may have seen if you read my earlier essay on optimizers. The main differences are in the nomenclature and the fact that we also include the cumulative history of gradients (m t).

If you want to compensate for bias, you should focus on the third step of the update rule I just gave you.

**Python code for RMSProp**

Therefore, the following Python code defines the Adam function.

the definition of Adam()

mw, mb, vw, vb, eps, beta1, beta2 = 0 0 0 0, 1e-8, 0.9, 0.99; w, b, eta, max epochs = 1, 1, 0.01, 100;

in range(max epochs), for I in I

If (x,y) in the data, dw and db are both zero, then (dw+=grad w)x must be greater than (y)than (db) (w, b, x, y)

The formula for db+ is: grad b (w, b, x, y)

Beta1 = mw The formula is as follows: * mb + (1-beta1) * db mb = beta1

Value of vw = beta2 * vw + (1-beta2) * dw**2 Value of vb = beta2 * vb + (1-beta2) * db**2

Two megawatts are equal to one megawatt divided by one beta-one squared plus one I plus one.

mb = 1-beta1**(i+1) / mb

The formula for vw is: vw = 1-beta2**(i+1)/vw

vb = 1-beta2**(i+1) / vb

If you take out eta and multiply by mw/np.sqrt(vw + eps), you get w.

The formula for b is: b = eta * mb/np.sqrt(vb + eps).

print(error(w,b))

A detailed explanation of how the Adam optimizer functions is provided below.

Adam’s Action Steps

**The procedures consist of:**

From the previous iteration, a) keep the momentum and b) the square of the gradient total.

Assume b) square decay and momentum.

c) Next, record the gradient at the ball’s location, as depicted in the illustration.

d) Then, multiply the momentum by the gradient, and the gradient’s square by itself (scaled by 1-decay)

Then, we’ll e) split the momentum down the length of the square.

f) Proceed forward one step, and the cycle will restart as seen in the diagram.

For those interested in real-time animation, I highly suggest checking out the aforementioned software.

You’ll be able to picture things extremely clearly thanks to this.

To clarify, Adam derives his speed from his momentum, whereas RMSProp gives him the ability to adjust to varying slopes. As a result of combining them, it is both more effective and faster than alternative optimizers.

**Summary**

I’m hoping this post helped you understand what adam optimizer is and how it operates. Also, you get a clear picture of why Adam is the most crucial optimizer compared to other algorithms that purport to do the same thing. In subsequent pieces, I’ll go into greater depth on a different class of optimizers. If you’re looking for additional information about data science, machine learning, AI, and cutting-edge technology, then you should check out the content we’ve put together over at InsideAIML.

Please accept my sincere gratitude for taking the time to read…