SVRG which stands for Stochastic Variance Reduced Gradients, is an optimization technique that was first introduced in paper Accelerating Stochastic Gradient Descent using Predictive Variance Reduction in 2013. It is complement to SGD (Stochastic Gradient Descent), which is known for large scale optimization but suffers from slow convergence asymptotically due to its inherent variance. SGD approximates the full gradients using a small batch of data or a single data sample, which will introduce variance and thus requires to start with a small learning rate in order to ensure convergence. SVRG remedies the problem by keeping track of a version of estimated weights that close to the optimal parameter values and maintaining an average of full gradients over a full pass of data. The average of full gradients is calculated with respect to the weights from the last m-th epochs in the training. SVRG uses a different update rule: gradients w.r.t current parameter values minus gradients w.r.t to parameters from the last m-th epochs plus the average of full gradients over all data.
Key Characteristics of SVRG:
SVRG optimization is implemented as a SVRGModule in mxnet.contrib.svrg_optimization
, which is an extension of the
existing mxnet.module.Module
APIs and encapsulates SVRG optimization logic within several new functions. SVRGModule
API changes compared to Module API to end users are minimal.
In distributed training, each worker gets the same special weights from the last m-th epoch and calculates the full
gradients with respect to its own shard of data. The standard SVRG optimization requires building a global full
gradients, which is calculated by aggregating the full gradients from each worker and averaging over the number of
workers. The workaround is to keep an additional set of keys in the KVStore that maps to full gradients.
The _SVRGOptimizer
is designed to wrap two optimizers, an _AssignmentOptimizer
which is used for full gradients
accumulation in the KVStore and a regular optimizer that performs actual update rule to the parameters.
The _SVRGOptimizer
and _AssignmentOptimizer
are designed to be used in SVRGModule
only.
.. warning:: This package contains experimental APIs and may change in the near future.
This document lists the SVRGModule APIs in MXNet/Contrib package:
.. autosummary::
:nosignatures:
mxnet.contrib.svrg_optimization.svrg_module
The only extra step to use a SVRGModule compared to use a Module is to check if the current epoch should update the full gradients over all data. Code snippets below demonstrate the suggested usage of SVRGModule using intermediate level APIs.
>>> mod = SVRGModule(symbol=model, update_freq=2, data_names=['data'], label_names=['lin_reg_label'])
>>> mod.bind(data_shapes=di.provide_data, label_shapes=di.provide_label)
>>> mod.init_params()
>>> mod.init_optimizer(optimizer='sgd', optimizer_params=(('learning_rate', 0.01), ), kvstore='local')
>>> for epoch in range(num_epochs):
... if epoch % mod.update_freq == 0:
... mod.update_full_grads(di)
... di.reset()
... for batch in di:
... mod.forward_backward(data_batch=batch)
... mod.update()
The high level API usage of SVRGModule remains exactly the same as Module API. Code snippets below gives an example of suggested usage of high level API.
>>> mod = SVRGModule(symbol=model, update_freq=2, data_names=['data'], label_names=['lin_reg_label'])
>>> mod.fit(di, num_epochs=100, optimizer='sgd', optimizer_params=(('learning_rate', 0.01), ))
.. automodule:: mxnet.contrib.svrg_optimization.svrg_module
.. autoclass:: mxnet.contrib.svrg_optimization.svrg_module.SVRGModule
:members: init_optimizer, bind, forward, backward, reshape, update, update_full_grads, fit, prepare
Can you improve this documentation? These fine people already did:
Sheng Zha & StephanieYuanEdit on GitHub
cljdoc is a website building & hosting documentation for Clojure/Script libraries
× close