Parameters¶
Overview¶
To run HetSeq, almost all the parameters are passed through commond line processed by argparse
. Thoses parameters can be grouped into several clusters:
Extendable Components Parameters
: includingtask
,optimizer
, andlr_scheduler
;
Distributed Parameters
: to set up distributed training enviroments;
Training Prameters
: other important parameters to control stop criteria, logging information, checkpoints and etc.
Here we are going to explain most parameters in details.
Extendable Components Parameters¶
Task:
--task
: Application name, its corresponding class defines major parts of the application. Currently supportbert
andmnist
, can be extended to other models.--task bert
:- Extra parameters for
bert
task: --data
: Dataset directory or file to be loaded in the corresponding task.--config_file
: Configuration file ofBERT
model, example can be found here--dict
: PATH of BPE dictionary for BERT model. Typically it has ~30,000 tokens.--max_pred_length
: max number of tokens in a sentence,512
by default.--num_file
: number of input files for training, used with--data
to debug.0
by default to use all the data files.
- Extra parameters for
--task mnist
:- Extra parameters for
mnist
task: --data
: Dataset directory or file to be loaded in the corresponding task, compatible withtorchvision.datasets.MNIST(path, train=True, download=True)
.
- Extra parameters for
Optimizer:
--optimizer
: Optimizer defined in HetSeq is based ontorch.optim.Optimizer
with extra gradient and learning rate manipulation function. Currently supportadam
andadadelta
which can be extended to many other optimizers.--optimizer adam
:- Extra parameters for
adam
optimizer: Fixed Weight Decay Regularization in Adam. --adam-betas
: betas to control momentum and velocity. Default=’(0.9, 0.999)’.--adam-eps
: epsilon for avoiding deviding by 0. Default=1e-8.--weight-decay
: weight decay.0
by default.
- Extra parameters for
--optimizer adadelta
:- Extra parameters for
adadelta
optimizer: --adadelta_rho
: Default=0.9.--adadelta_eps
: epsilon for avoiding deviding by 0. Default=1e-6.--dadelta_weight_decay
:0
by default.
- Extra parameters for
Lr_scheduler:
--lr_scheduler
: Learning rate scheduler defined in HetSeq customized to consider stop criteriaend-learning-rate
,total-num-update
andwarmup-updates
. Currently supportPolynomial Decay Scheduler
.--optimizer PolynomialDecayScheduler
:- Extra parameters for
PolynomialDecayScheduler
: --force-anneal
: force annealing at specified epoch, by default not existed.--power
: decay power.1.0
by default.--warmup-updates
: warmup the learning rate linearly for the first N updates,0
by default.--total-num-update
: total number of update steps until learning rate decay to--end-learning-rate
,10000
by default.--end-learning-rate
: learning rate when traing stops.0
by default.
- Extra parameters for
Distributed Parameters¶
Distrbuted parameters play a key role in HetSeq to set up the distrbuted training environments, it defines the number of nodes, number of GPUs, communication methods and etc.
--fast-stat-sync
: Enable fast sync of stats between nodes, this hardcodes to sync only some default stats from logging_output.--device-id
: index of single GPU used in the training.0
by default.
torch.nn.parallel.distributed import DistributedDataParallel
related parameters, see document for more informaiton. Our implementation consider input and put tensors on the same device.
--bucket-cap-mb
:25
by default
--find-unused-parameters
:False
by default
torch.distributed.init_process_group
related parameters, control the main environment of distributed training. See Distributed Setting or document for more information.--ddp-backend
: distributed data parallel backend, currently only supportc10d
withNCCL
to communicate between GPUs. Default: ‘c10d’.--distributed-init-method
: initial methods to communicate between GPUs. DefaultNone
.--distributed-world-size
: total number of GPUs/processes in the distributed seeting. Defalut:max(1, torch.cuda.device_count())
.--distributed-rank
: rank of the current GPU,0
by default.
Training Parameters¶
--max-epoch
: maximum epoches allowd in the training.0
by default.--max-update
: maximum number of updates allowd in the training.0
by default.--required-batch-size-multiple
: check the batch size is the multiple times of the given number.1
by default.--update-freq
: update parameters everyN_i
batches, when in epoch i.1
by default.--max-tokens
: maximum number of tokens of a batch, not assigned.--max-sentences
: maximum number of sentences/images/instances of a batch (batch size), not assigned.
Note
--max-tokens
or --max-sentences
must be assigned in the prameter settings.
--train-subset
: string to store training subset,train
by default.--num-workers
: number of threads used in the data loading process.--save-interval-updates
: save a checkpoint (and validate) every N updates,0
by default.--seed
: onlu seed in the training process to control all the possible random steps (e.g. intorch
,numpy
andrandom
).19940802
by default.--log-interval
: log progress every N batches (when progress bar is disabled),1
by default.--log-format
: log format to use, choices=[‘none’, ‘simple’],simple
by default.