Parameters¶
Overview¶
To run HetSeq, almost all the parameters are passed through commond line processed by argparse. Thoses parameters can be grouped into several clusters:
Extendable Components Parameters: includingtask,optimizer, andlr_scheduler;
Distributed Parameters: to set up distributed training enviroments;
Training Prameters: other important parameters to control stop criteria, logging information, checkpoints and etc.
Here we are going to explain most parameters in details.
Extendable Components Parameters¶
Task:
--task: Application name, its corresponding class defines major parts of the application. Currently supportbertandmnist, can be extended to other models.--task bert:- Extra parameters for
berttask: --data: Dataset directory or file to be loaded in the corresponding task.--config_file: Configuration file ofBERTmodel, example can be found here--dict: PATH of BPE dictionary for BERT model. Typically it has ~30,000 tokens.--max_pred_length: max number of tokens in a sentence,512by default.--num_file: number of input files for training, used with--datato debug.0by default to use all the data files.
- Extra parameters for
--task mnist:- Extra parameters for
mnisttask: --data: Dataset directory or file to be loaded in the corresponding task, compatible withtorchvision.datasets.MNIST(path, train=True, download=True).
- Extra parameters for
Optimizer:
--optimizer: Optimizer defined in HetSeq is based ontorch.optim.Optimizerwith extra gradient and learning rate manipulation function. Currently supportadamandadadeltawhich can be extended to many other optimizers.--optimizer adam:- Extra parameters for
adamoptimizer: Fixed Weight Decay Regularization in Adam. --adam-betas: betas to control momentum and velocity. Default=’(0.9, 0.999)’.--adam-eps: epsilon for avoiding deviding by 0. Default=1e-8.--weight-decay: weight decay.0by default.
- Extra parameters for
--optimizer adadelta:- Extra parameters for
adadeltaoptimizer: --adadelta_rho: Default=0.9.--adadelta_eps: epsilon for avoiding deviding by 0. Default=1e-6.--dadelta_weight_decay:0by default.
- Extra parameters for
Lr_scheduler:
--lr_scheduler: Learning rate scheduler defined in HetSeq customized to consider stop criteriaend-learning-rate,total-num-updateandwarmup-updates. Currently supportPolynomial Decay Scheduler.--optimizer PolynomialDecayScheduler:- Extra parameters for
PolynomialDecayScheduler: --force-anneal: force annealing at specified epoch, by default not existed.--power: decay power.1.0by default.--warmup-updates: warmup the learning rate linearly for the first N updates,0by default.--total-num-update: total number of update steps until learning rate decay to--end-learning-rate,10000by default.--end-learning-rate: learning rate when traing stops.0by default.
- Extra parameters for
Distributed Parameters¶
Distrbuted parameters play a key role in HetSeq to set up the distrbuted training environments, it defines the number of nodes, number of GPUs, communication methods and etc.
--fast-stat-sync: Enable fast sync of stats between nodes, this hardcodes to sync only some default stats from logging_output.--device-id: index of single GPU used in the training.0by default.
torch.nn.parallel.distributed import DistributedDataParallel related parameters, see document for more informaiton. Our implementation consider input and put tensors on the same device.
--bucket-cap-mb:25by default
--find-unused-parameters:Falseby default
torch.distributed.init_process_grouprelated parameters, control the main environment of distributed training. See Distributed Setting or document for more information.--ddp-backend: distributed data parallel backend, currently only supportc10dwithNCCLto communicate between GPUs. Default: ‘c10d’.--distributed-init-method: initial methods to communicate between GPUs. DefaultNone.--distributed-world-size: total number of GPUs/processes in the distributed seeting. Defalut:max(1, torch.cuda.device_count()).--distributed-rank: rank of the current GPU,0by default.
Training Parameters¶
--max-epoch: maximum epoches allowd in the training.0by default.--max-update: maximum number of updates allowd in the training.0by default.--required-batch-size-multiple: check the batch size is the multiple times of the given number.1by default.--update-freq: update parameters everyN_ibatches, when in epoch i.1by default.--max-tokens: maximum number of tokens of a batch, not assigned.--max-sentences: maximum number of sentences/images/instances of a batch (batch size), not assigned.
Note
--max-tokens or --max-sentences must be assigned in the prameter settings.
--train-subset: string to store training subset,trainby default.--num-workers: number of threads used in the data loading process.--save-interval-updates: save a checkpoint (and validate) every N updates,0by default.--seed: onlu seed in the training process to control all the possible random steps (e.g. intorch,numpyandrandom).19940802by default.--log-interval: log progress every N batches (when progress bar is disabled),1by default.--log-format: log format to use, choices=[‘none’, ‘simple’],simpleby default.