Updaters/Optimizers
Special algorithms for gradient descent.
The main difference among the updaters is how they treat the learning rate. Stochastic Gradient Descent, the most common learning algorithm in deep learning, relies on
Theta
(the weights in hidden layers) and alpha
(the learning rate). Different updaters help optimize the learning rate until the neural network converges on its most performant state.To use the updaters, pass a new class to the
updater()
method in either a ComputationGraph
or MultiLayerNetwork
.ComputationGraphConfiguration conf = new NeuralNetConfiguration.Builder()
.updater(new Adam(0.01))
// add your layers and hyperparameters below
.build();
applyUpdater
public void applyUpdater(INDArray gradient, int iteration, int epoch)
Calculate the update based on the given gradient
- param gradient the gradient to get the update for
- param iteration
- return the gradient
Nesterov’s momentum. Keep track of the previous layer’s gradient and use it as a way of updating the gradient.
applyUpdater
public void applyUpdater(INDArray gradient, int iteration, int epoch)
Get the nesterov update
- param gradient the gradient to get the update for
- param iteration
- return
RMS Prop updates:
Vectorized Learning Rate used per Connection Weight
applyUpdater
public void applyUpdater(INDArray gradient, int iteration, int epoch)
Gets feature specific learning rates Adagrad keeps a history of gradients being passed in. Note that each gradient passed in becomes adapted over time, hence the opName adagrad
- param gradient the gradient to get learning rates for
- param iteration
applyUpdater
public void applyUpdater(INDArray gradient, int iteration, int epoch)
Calculate the update based on the given gradient
- param gradient the gradient to get the update for
- param iteration
- return the gradient
NoOp updater: gradient updater that makes no changes to the gradient
applyUpdater
public void applyUpdater(INDArray gradient, int iteration, int epoch)
Calculate the update based on the given gradient
- param gradient the gradient to get the update for
- param iteration
- return the gradient
Ada delta updater. More robust adagrad that keeps track of a moving window average of the gradient rather than the every decaying learning rates of adagrad
applyUpdater
public void applyUpdater(INDArray gradient, int iteration, int epoch)
Get the updated gradient for the given gradient and also update the state of ada delta.
- param gradient the gradient to get the updated gradient for
- param iteration
- return the update gradient
SGD updater applies a learning rate only
Gradient modifications: Calculates an update and tracks related information for gradient changes over time for handling updates.
The AMSGrad updater
Reference: On the Convergence of Adam and Beyond - https://openreview.net/forum?id=ryQu7f-RZ
Last modified 9mo ago