Updaters/Optimizers
Special algorithms for gradient descent.
What are updaters?
The main difference among the updaters is how they treat the learning rate. Stochastic Gradient Descent, the most common learning algorithm in deep learning, relies on Theta
(the weights in hidden layers) and alpha
(the learning rate). Different updaters help optimize the learning rate until the neural network converges on its most performant state.
Usage
To use the updaters, pass a new class to the updater()
method in either a ComputationGraph
or MultiLayerNetwork
.
Available updaters
NadamUpdater
The Nadam updater. https://arxiv.org/pdf/1609.04747.pdf
applyUpdater
Calculate the update based on the given gradient
param gradient the gradient to get the update for
param iteration
return the gradient
NesterovsUpdater
Nesterov’s momentum. Keep track of the previous layer’s gradient and use it as a way of updating the gradient.
applyUpdater
Get the nesterov update
param gradient the gradient to get the update for
param iteration
return
RmsPropUpdater
RMS Prop updates:
http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf http://cs231n.github.io/neural-networks-3/#ada
AdaGradUpdater
Vectorized Learning Rate used per Connection Weight
Adapted from: http://xcorr.net/2014/01/23/adagrad-eliminating-learning-rates-in-stochastic-gradient-descent See also http://cs231n.github.io/neural-networks-3/#ada
applyUpdater
Gets feature specific learning rates Adagrad keeps a history of gradients being passed in. Note that each gradient passed in becomes adapted over time, hence the opName adagrad
param gradient the gradient to get learning rates for
param iteration
AdaMaxUpdater
The AdaMax updater, a variant of Adam. http://arxiv.org/abs/1412.6980
applyUpdater
Calculate the update based on the given gradient
param gradient the gradient to get the update for
param iteration
return the gradient
NoOpUpdater
NoOp updater: gradient updater that makes no changes to the gradient
AdamUpdater
The Adam updater. http://arxiv.org/abs/1412.6980
applyUpdater
Calculate the update based on the given gradient
param gradient the gradient to get the update for
param iteration
return the gradient
AdaDeltaUpdater
http://www.matthewzeiler.com/pubs/googleTR2012/googleTR2012.pdf https://arxiv.org/pdf/1212.5701v1.pdf
Ada delta updater. More robust adagrad that keeps track of a moving window average of the gradient rather than the every decaying learning rates of adagrad
applyUpdater
Get the updated gradient for the given gradient and also update the state of ada delta.
param gradient the gradient to get the updated gradient for
param iteration
return the update gradient
SgdUpdater
SGD updater applies a learning rate only
GradientUpdater
Gradient modifications: Calculates an update and tracks related information for gradient changes over time for handling updates.
AMSGradUpdater
The AMSGrad updater Reference: On the Convergence of Adam and Beyond - https://openreview.net/forum?id=ryQu7f-RZ
Last updated