Optimizer¶
-
class
hetseq.optim.
Adam
(*args: Any, **kwargs: Any)[source]¶ Implements Adam algorithm. This implementation is modified from torch.optim.Adam based on: Fixed Weight Decay Regularization in Adam (see https://arxiv.org/abs/1711.05101) It has been proposed in Adam: A Method for Stochastic Optimization. Arguments:
- params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing
running averages of gradient and its square (default: (0.9, 0.999))
- eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-8)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0) amsgrad (boolean, optional): whether to use the AMSGrad variant of this
algorithm from the paper On the Convergence of Adam and Beyond
-
class
hetseq.optim.
Adadelta
(*args: Any, **kwargs: Any)[source]¶ Implements Adadelta algorithm. It has been proposed in ADADELTA: An Adaptive Learning Rate Method. Arguments:
- params (iterable): iterable of parameters to optimize or dicts defining
parameter groups
- rho (float, optional): coefficient used for computing a running average
of squared gradients (default: 0.9)
- eps (float, optional): term added to the denominator to improve
numerical stability (default: 1e-6)
- lr (float, optional): coefficient that scale delta before it is applied
to the parameters (default: 1.0)
weight_decay (float, optional): weight decay (L2 penalty) (default: 0)