(A check list for myself. At least human-readable 🤔)
¶Initialization
- DDP arguments and device
torch.cuda.set_device(ddp_local_rank)
- output directory: make dir
- logger
- file handler
- if you want to log w/ datetime, remember to
barrier
- if you want to log w/ datetime, remember to
- (optional) log the output directory
- file handler
- save config files
- seed
- each process gets a different seed (
seed + ddp_rank
) - NOTE: the seed should be the same during data split (maybe not exists) and model initialization.
- However, DDP forces the parameters to be the same across processes, so we do not need to
- https://github.com/pytorch/pytorch/blob/1dba81f56dc33b44d7b0ecc92a039fe32ee80f8d/torch/nn/parallel/distributed.py#LL798C63-L798C63#
- each process gets a different seed (
- initialize context for amp training
- depending on the chosen dtype
- data loader
- create dataset
- If there are data downloading action, remember to barrier other process, We only need the master process of each node (
local_ddp_rank==0
) to download the dataset - split dataset, if using Huggingface’s
datasets
.- The split sub-dataset may have different length. This may be problematic for epoch-based trainer, while step-based trainer, it is fine.
- If there are data downloading action, remember to barrier other process, We only need the master process of each node (
- create dataloader (by default,
DataLoader
usesRandomSampler
)- The yield of dataloader is a list of samples if
collate_fn=lambda x: x
- If you do not use Huggingface’s
dataset
, you need to useDistributedSampler
inDataloader
- the difference between
DistributedSampler
andRandomSampler
- DistributedSampler(https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler) generate a
torch.Generator
every time it is iterated. Thus it needsset_epoch
to change the seed value each time.- it is a compromise to the design which split the data inside the sampler, then the index of all the data is shuffled globally.
- It needs a base seed which is same for all the processes.
- RandomSampler(https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#RandomSampler) get a generator when it is initialized, not changed every time it is iterated.
- DistributedSampler(https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler) generate a
- the difference between
- If you incorporate numpy randomness in
Dataset
, setworker_fn
to break the problem of same randomness among processes - if you use
opencv
inDataset
, set the thread to one, or some function occupies all the threads.
- The yield of dataloader is a list of samples if
- a helper function to move data from CPU to GPU form HuggingFace’s
transformers
- create dataset
- model
- initialization
- from scratch
- from resume
- note that in DDP, the
map_location
must specify or the tensors woule be loaded to the original device; - set to
map_location=cpu
may crash the VRAM
- note that in DDP, the
- Free the memory for the loaded file
ckpt=None
ordel ckpt
- move to device
- initialization
- grad scaler for amp training if
dtype=float16
- optimizer
- set
param.requires_grad={True,False
} to determine trainable parameters - to gather
param_groups = List[param_dict]
- get all parameters
- filter: keep
requires_grad == True
- decay:
p.dim() >= 2
Any parameters that is 2D will be weight decayed
- (optional) try “fused” AdamW
"fused" in inspect.signature(torch.optim.AdamW).parameters
- set
- (optional) compile model
- wrap model in to DDP container
- get
raw_model
to save ckpt w/o"module."
- get
¶Training
(note about global variable, if it is non-editable variable and is modified in a function, use global
to claim it is global)
model.train()
- adjust lr based on step, change lr manually
for param_group in optimizer.param_groups: param_group["lr"] = lr
- eval model and save weight
- eval: distributed eval (check below)?
- log: only the master process log?
- save weight: only the master process save
- training step
- gradient_accumulation_steps
- in model forward and loss computation, use amp context
loss is float32 because ``mse_loss`` layers ``autocast`` to float32.
;output is float16 because convertable layers ``autocast`` to float16.
- You don’t need to manually change inputs’
dtype
when enabling mixed precision. - if use
gradient_accumulation_steps
, the loss need to be divided bygradient_accumulation_steps
- scale the loss then backward gradients
- clip gradient: first
unscale_
optimizer, thenclip_grad_norm_
- update model w/ scaler
- to update model
scaler.step(optimizer)
- update scaler:
scaler.update()
- to update model
- flush the gradients
optimizer.zero_grad(set_to_none=True)
- log: training metrics and time for model and data
Evaluation: tips about distributed evaluation
The results of each card might be different (if use amp/float16)
model.eval()
- in loss computation, we sum up all the loss per mini-batch
- if one average the loss for a batch, then the loss needs to multiply back the divided numbers
- the divided numbers may not be the same as back size.
- e.g., the number of masks per sample, the number of tokens per caption per sample
- We also need to track and sum up the divided numbers
- reduce all the tracked and summed values.
dist.all_reduce