Optimizer

class hetseq.optim.Adam(*args: Any, **kwargs: Any)[source]

Implements Adam algorithm. This implementation is modified from torch.optim.Adam based on: Fixed Weight Decay Regularization in Adam (see https://arxiv.org/abs/1711.05101) It has been proposed in Adam: A Method for Stochastic Optimization. Arguments:

params (iterable): iterable of parameters to optimize or dicts defining

parameter groups

lr (float, optional): learning rate (default: 1e-3) betas (Tuple[float, float], optional): coefficients used for computing

running averages of gradient and its square (default: (0.9, 0.999))

eps (float, optional): term added to the denominator to improve

numerical stability (default: 1e-8)

weight_decay (float, optional): weight decay (L2 penalty) (default: 0) amsgrad (boolean, optional): whether to use the AMSGrad variant of this

algorithm from the paper On the Convergence of Adam and Beyond

class hetseq.optim.Adadelta(*args: Any, **kwargs: Any)[source]

Implements Adadelta algorithm. It has been proposed in ADADELTA: An Adaptive Learning Rate Method. Arguments:

params (iterable): iterable of parameters to optimize or dicts defining

parameter groups

rho (float, optional): coefficient used for computing a running average

of squared gradients (default: 0.9)

eps (float, optional): term added to the denominator to improve

numerical stability (default: 1e-6)

lr (float, optional): coefficient that scale delta before it is applied

to the parameters (default: 1.0)

weight_decay (float, optional): weight decay (L2 penalty) (default: 0)