(A check list for myself. At least human-readable 🤔)
¶Initialization
- DDP arguments and device
torch.cuda.set_device(ddp_local_rank)
 - output directory: make dir
 - logger
- file handler
- if you want to log w/ datetime, remember to 
barrier 
 - if you want to log w/ datetime, remember to 
 - (optional) log the output directory
 
 - file handler
 - save config files
 - seed
- each process gets a different seed (
seed + ddp_rank) - NOTE: the seed should be the same during data split (maybe not exists) and model initialization.
 - However, DDP forces the parameters to be the same across processes, so we do not need to
 - https://github.com/pytorch/pytorch/blob/1dba81f56dc33b44d7b0ecc92a039fe32ee80f8d/torch/nn/parallel/distributed.py#LL798C63-L798C63#
 
 - each process gets a different seed (
 - initialize context for amp training
- depending on the chosen dtype
 
 - data loader
- create dataset
- If there are data downloading action, remember to barrier other process,  We only need the master process of each node (
local_ddp_rank==0) to download the dataset - split dataset, if using Huggingface’s 
datasets.- The split sub-dataset may have different length. This may be problematic for epoch-based trainer, while step-based trainer, it is fine.
 
 
 - If there are data downloading action, remember to barrier other process,  We only need the master process of each node (
 - create dataloader (by default, 
DataLoaderusesRandomSampler)- The yield of dataloader is a list of samples if 
collate_fn=lambda x: x - If you do not use Huggingface’s 
dataset, you need to useDistributedSamplerinDataloader- the difference between 
DistributedSamplerandRandomSampler- DistributedSampler(https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler) generate a 
torch.Generatorevery time it is iterated. Thus it needsset_epochto change the seed value each time.- it is a compromise to the design which split the data inside the sampler, then the index of all the data is shuffled globally.
 - It needs a base seed which is same for all the processes.
 
 - RandomSampler(https://pytorch.org/docs/stable/_modules/torch/utils/data/sampler.html#RandomSampler) get a generator when it is initialized, not changed every time it is iterated.
 
 - DistributedSampler(https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html#DistributedSampler) generate a 
 
 - the difference between 
 - If you incorporate numpy randomness in 
Dataset, setworker_fnto break the problem of same randomness among processes - if you use 
opencvinDataset, set the thread to one, or some function occupies all the threads. 
 - The yield of dataloader is a list of samples if 
 - a helper function to move data from CPU to GPU form HuggingFace’s 
transformers 
 - create dataset
 - model
- initialization
- from scratch
 - from resume
- note that in DDP, the 
map_locationmust specify or the tensors woule be loaded to the original device; - set to 
map_location=cpumay crash the VRAM 
 - note that in DDP, the 
 - Free the memory for the loaded file 
ckpt=Noneordel ckpt 
 - move to device
 
 - initialization
 - grad scaler for amp training if 
dtype=float16 - optimizer
- set 
param.requires_grad={True,False} to determine trainable parameters - to gather 
param_groups = List[param_dict]- get all parameters
 - filter: keep 
requires_grad == True - decay: 
p.dim() >= 2Any parameters that is 2D will be weight decayed 
 - (optional) try “fused” AdamW 
"fused" in inspect.signature(torch.optim.AdamW).parameters 
 - set 
 - (optional) compile model
 - wrap model in to DDP container
- get 
raw_modelto save ckpt w/o"module." 
 - get 
 
¶Training
(note about global variable, if it is non-editable variable and is modified in a function, use global to claim it is global)
model.train()- adjust lr based on step, change lr manually 
for param_group in optimizer.param_groups: param_group["lr"] = lr - eval model and save weight
- eval: distributed eval (check below)?
 - log: only the master process log?
 - save weight: only the master process save
 
 - training step
- gradient_accumulation_steps
 - in model forward and loss computation, use amp context
loss is float32 because ``mse_loss`` layers ``autocast`` to float32.;output is float16 because convertable layers ``autocast`` to float16.- You don’t need to manually change inputs’ 
dtypewhen enabling mixed precision. - if use 
gradient_accumulation_steps, the loss need to be divided bygradient_accumulation_steps 
 - scale the loss then backward gradients
 - clip gradient: first 
unscale_optimizer, thenclip_grad_norm_ - update model w/ scaler
- to update model  
scaler.step(optimizer) - update scaler: 
scaler.update() 
 - to update model  
 - flush the gradients 
optimizer.zero_grad(set_to_none=True) - log: training metrics and time for model and data
 
 
Evaluation: tips about distributed evaluation
The results of each card might be different (if use amp/float16)
model.eval()- in loss computation, we sum up all the loss per mini-batch
- if one average the loss for a batch, then the loss needs to multiply back the divided numbers
 - the divided numbers may not be the same as back size.
- e.g., the number of masks per sample, the number of tokens per caption per sample
 
 
 - We also need to track and sum up the divided numbers
 - reduce all the tracked and summed values. 
dist.all_reduce