Learning using optimizers such as AdaGrad or RMSProp is adaptive as the learning rate is
changed based on a pre-defined schedule after every epoch.
Since different dimension has different impact, adapting the learning rate per parameter
could lead to good convergence solution.
Since AdaGrad technique, adapts the learning rate based on the gradients, it could
converge faster but it also suffers from an issue with the scaling of the learning rate,
which impacts the learning process and could lead to sub-optimal solution.
RMSProp technique is similar to the AdaGrad technique but it scales the learning rate
using exponentially decaying average of squared gradients.
0
Adam optimizer always converge at better solution than the stochastic gradient descent
optimizer.
Fig: 1