Parameters¶

Overview¶

To run HetSeq, almost all the parameters are passed through commond line processed by argparse. Thoses parameters can be grouped into several clusters:

Extendable Components Parameters: including task, optimizer, and lr_scheduler;

Distributed Parameters: to set up distributed training enviroments;

Training Prameters: other important parameters to control stop criteria, logging information, checkpoints and etc.

Here we are going to explain most parameters in details.

Extendable Components Parameters¶

Task: --task: Application name, its corresponding class defines major parts of the application. Currently support bert and mnist, can be extended to other models.
- --task bert:
  
  Extra parameters for bert task:
  
  --data: Dataset directory or file to be loaded in the corresponding task.
  
  --config_file: Configuration file of BERT model, example can be found here
  
  --dict: PATH of BPE dictionary for BERT model. Typically it has ~30,000 tokens.
  
  --max_pred_length: max number of tokens in a sentence, 512 by default.
  
  --num_file: number of input files for training, used with --data to debug. 0 by default to use all the data files.
- --task mnist:
  
  Extra parameters for mnist task:
  
  --data: Dataset directory or file to be loaded in the corresponding task, compatible with torchvision.datasets.MNIST(path, train=True, download=True).
Optimizer: --optimizer: Optimizer defined in HetSeq is based on torch.optim.Optimizer with extra gradient and learning rate manipulation function. Currently support adam and adadelta which can be extended to many other optimizers.
- --optimizer adam:
  
  Extra parameters for adam optimizer: Fixed Weight Decay Regularization in Adam.
  
  --adam-betas: betas to control momentum and velocity. Default=’(0.9, 0.999)’.
  
  --adam-eps: epsilon for avoiding deviding by 0. Default=1e-8.
  
  --weight-decay: weight decay. 0 by default.
- --optimizer adadelta:
  
  Extra parameters for adadelta optimizer:
  
  --adadelta_rho: Default=0.9.
  
  --adadelta_eps: epsilon for avoiding deviding by 0. Default=1e-6.
  
  --dadelta_weight_decay: 0 by default.
Lr_scheduler: --lr_scheduler: Learning rate scheduler defined in HetSeq customized to consider stop criteria end-learning-rate, total-num-update and warmup-updates. Currently support Polynomial Decay Scheduler.
- --optimizer PolynomialDecayScheduler:
  
  Extra parameters for PolynomialDecayScheduler:
  
  --force-anneal: force annealing at specified epoch, by default not existed.
  
  --power: decay power. 1.0 by default.
  
  --warmup-updates: warmup the learning rate linearly for the first N updates, 0 by default.
  
  --total-num-update: total number of update steps until learning rate decay to --end-learning-rate, 10000 by default.
  
  --end-learning-rate: learning rate when traing stops. 0 by default.

Distributed Parameters¶

Distrbuted parameters play a key role in HetSeq to set up the distrbuted training environments, it defines the number of nodes, number of GPUs, communication methods and etc.

--fast-stat-sync: Enable fast sync of stats between nodes, this hardcodes to sync only some default stats from logging_output.
--device-id: index of single GPU used in the training. 0 by default.

torch.nn.parallel.distributed import DistributedDataParallel related parameters, see document for more informaiton. Our implementation consider input and put tensors on the same device.

--bucket-cap-mb: 25 by default

--find-unused-parameters: False by default

torch.distributed.init_process_group related parameters, control the main environment of distributed training. See Distributed Setting or document for more information.

--ddp-backend: distributed data parallel backend, currently only support c10d with NCCL to communicate between GPUs. Default: ‘c10d’.
--distributed-init-method: initial methods to communicate between GPUs. Default None.
--distributed-world-size: total number of GPUs/processes in the distributed seeting. Defalut: max(1, torch.cuda.device_count()).
--distributed-rank: rank of the current GPU, 0 by default.

Training Parameters¶

--max-epoch: maximum epoches allowd in the training. 0 by default.
--max-update: maximum number of updates allowd in the training. 0 by default.
--required-batch-size-multiple: check the batch size is the multiple times of the given number. 1 by default.
--update-freq: update parameters every N_i batches, when in epoch i. 1 by default.
--max-tokens: maximum number of tokens of a batch, not assigned.
--max-sentences: maximum number of sentences/images/instances of a batch (batch size), not assigned.

Note

--max-tokens or --max-sentences must be assigned in the prameter settings.

--train-subset: string to store training subset, train by default.
--num-workers: number of threads used in the data loading process.
--save-interval-updates: save a checkpoint (and validate) every N updates, 0 by default.
--seed: onlu seed in the training process to control all the possible random steps (e.g. in torch, numpy and random). 19940802 by default.
--log-interval: log progress every N batches (when progress bar is disabled), 1 by default.
--log-format: log format to use, choices=[‘none’, ‘simple’], simple by default.