(一)gradient_accumulate_steps
对于模型训练来说,batch_size越大,模型效果会越好。但是某些环境下,没有足够的GPU来支撑起大的batch_size,因此这时可以考虑使用gradient_accumulate_steps来达到类似的效果。
具体地,原来训练过程中每个batch_size都会进行梯度更新,这时我们可以采取每训练(叠加)gradient_accumulate_steps个batch_size再更新梯度(这个操作就相当于将batch_size扩大了gradient_accumulate_steps倍)。更新梯度使用optimizer.step()。
# 该函数的实现包括了warmup和lr_decay def warmup_linear(x, warmup = 0.002): if x < warmup: return x/warmup return 1.0 - x global_step_th = int(len(train_examples)/batch_size/gradient_accumulation_steps * start_epoch) for epoch in range(start_epoch, total_train_epoch): train_loss = 0 train_start = time.time() model.train() optimizer.zero_grad() for step, batch in enumerate(train_dataloader): create input and output for batch object_loss # loss regularization if gradient_accumulation_steps > 1: object_loss = object_loss / gradient_accumulation_steps # Implementation of backpropagation object_loss.backward() train_loss = train_loss + object_loss.item() if (step+1) % gradient_accumulation_steps == 0: # modifying and update the learning rate with warm up which bert uses. lr_this_step = learning_rate*warmup_linear(global_step_th/total_train_steps) # for the params in optimizer, we update the learning rate for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step_th = global_step_th + 1
参考文献:
https://blog.csdn.net/Princeicon/article/details/108058822
https://cowarder.site/2019/10/29/Gradient-Accumulation/
https://www.cnblogs.com/lart/p/11628696.html
二)调节学习率(学习率的预热与衰减)
(1)学习率的调节步骤一般是先预热,后衰减。
(2)学习率的调节方式包括手动调节和库函数调节。二者除了在引入方法上不一样之外,库函数调节不需要显式的更换模型中的学习率,而手动调节需要在optimizer.param_groups['lr']=updated_lr进行调节。
① Warmup(学习率预热)
由于刚开始训练时,模型的权重(weights)是随机初始化的,此时若选择一个较大的学习率,可能带来模型的不稳定(振荡),选择Warmup预热学习率的方式,可以使得开始训练的几个epoches或者一些steps内学习率较小,在预热的小学习率下,模型可以慢慢趋于稳定,等模型相对稳定后再选择预先设置的学习率进行训练,使得模型收敛速度变得更快,模型效果更佳。
参考文献:
https://blog.csdn.net/Xiao_CangTian/article/details/109269555
https://blog.csdn.net/sinat_36618660/article/details/99650804
https://blog.csdn.net/shanglianlm/article/details/85143614
https://blog.csdn.net/Guo_Python/article/details/106019396
https://zhuanlan.zhihu.com/p/136183319
② lr_decay(学习率衰减)
在预热结束后,学习率达到一定的需求。此时,如果一直在大的学习率上执行训练,可能使模型loss持续震荡。因此,随着训练的进行,我们逐步降低学习率。
参考文献:
https://cloud.tencent.com/developer/article/1539729
https://blog.csdn.net/qq_40367479/article/details/82530324
https://blog.csdn.net/dou3516/article/details/105329103
综合参考:
https://zhuanlan.zhihu.com/p/392350994
https://www.cnblogs.com/wuliytTaotao/p/11101652.html
手动调节学习率:(需要将改变后的学习率手动更新到模型中) EXAMPLE.1(该样例的实现包含了gradient_accumulate) # 该函数的实现包括了warmup和lr_decay def warmup_linear(x, warmup = 0.002): if x < warmup: return x/warmup return 1.0 - x for epoch in range(start_epoch, total_train_epoch): train_loss = 0 train_start = time.time() model.train() optimizer.zero_grad() for step, batch in enumerate(train_dataloader): create input and output for batch object_loss # loss regularization if gradient_accumulation_steps > 1: object_loss = object_loss / gradient_accumulation_steps # Implementation of backpropagation object_loss.backward() train_loss = train_loss + object_loss.item() if (step+1) % gradient_accumulation_steps == 0: # modifying and update the learning rate with warm up which bert uses. lr_this_step = learning_rate*warmup_linear(global_step_th/total_train_steps) # for the params in optimizer, we update the learning rate for param_group in optimizer.param_groups: param_group['lr'] = lr_this_step optimizer.step() optimizer.zero_grad() global_step_th = global_step_th + 1 EXAMPLE.2 Facebook提出的gradual warmup import numpy as np warmup_steps = 2500 init_lr = 0.1 # 模拟训练15000步 max_steps = 15000 for train_steps in range(max_steps): # 实现warmup if warmup_steps and train_steps < warmup_steps: warmup_percent_done = train_steps / warmup_steps warmup_learning_rate = init_lr * warmup_percent_done #gradual warmup_lr learning_rate = warmup_learning_rate # 实现lr_decay else: #learning_rate = np.sin(learning_rate) #预热学习率结束后,学习率呈sin衰减 learning_rate = learning_rate**1.0001 #预热学习率结束后,学习率呈指数衰减(近似模拟指数衰减)
库函数调节: 常见的学习率调节库函数: torch.optim.lr_scheduler.StepLR(optimizer, step_size, gamma=0.1, last_epoch=-1) torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones, gamma=0.1, last_epoch=-1) torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma, last_epoch=-1) torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max, eta_min=0, last_epoch=-1) torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=10, verbose=False, threshold=0.0001, threshold_mode='rel', cooldown=0, min_lr=0, eps=1e-08) torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda, last_epoch=-1) EXAMPLE.1(该代码仅实现了lr_decay) optimizer = torch.optim.SGD(model.parameters(), lr=0.1) # 学习率初值0.1 scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.5) # 每十次迭代,学习率减半 for i in range(1,100): scheduler.step() # 学习率迭代次数+1 arr.append(optimizer.state_dict()['param_groups'][0]['lr']) #arr.append(scheduler.get_lr()) # 与前一句功能相同,都是为了获取学习率的数值 EXAMPLE.2(该代码仅实现了lr_decay) lambdaf = lambda epoch: 0.05 + (epoch) / 100 scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lambdaf) for i in range(1,100): scheduler.step()# 学习率迭代次数+1,同时将迭代次数作为参数传给lambdaf EXAMPLE.3(该代码仅实现了warmup) def warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor): def f(x): # x是step次数 if x >= warmup_iters: return 1 alpha = float(x) / warmup_iters # 当前进度 0-1 return warmup_factor * (1 - alpha) + alpha return torch.optim.lr_scheduler.LambdaLR(optimizer, f) if warmup: warmup_factor = 1. / 1000 warmup_iters = min(1000, len(train_loader) - 1) lr_scheduler = warmup_lr_scheduler(optimizer, warmup_iters, warmup_factor) for i in range(1,100): lr_scheduler.step() #将学习率迭代次数+1传给f EXAMPLE.4(该代码实现warmup和lr_decay) optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9, weight_decay=5e-4) warmup_epoch = 5 scheduler = CosineAnnealingLR(optimizer, 100 - warmup_epoch) warmup_scheduler = WarmUpLR(optimizer, iter_per_epoch * warmup_epoch) for epoch in range(1, max_epoch+1): # lr_decay过程 if epoch >= warmup_epoch: scheduler.step() learn_rate = scheduler.get_lr()[0] print("Learn_rate:%s" % learn_rate) # warmup过程 else: warmup_scheduler.step() warm_lr = warmup_scheduler.get_lr() print("warm_lr:%s" % warm_lr) EXAMPLE.5(transformers.get_linear_schedule_with_warmup(),实现学习率预热和学习率下降) total_steps = len(train_dataloader) * num_epoch optimizer = AdamW(optimizer_grouped_parameters, lr=3e-5, eps=1e-8) lr_scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps= warmup_rate* total_steps, num_training_steps = total_steps) for __ in trange(num_epoch, desc = 'Epoch'): model.train() total_loss = 0 for batch_iter, batch_dataloader in enumerate(train_dataloader): ...... loss = outputs.loss loss.backward() total_loss = total_loss + loss.cpu().item() # item() return the value of tensor # update the parameters optimizer.step() # update the learning rate lr_scheduler.step()
注:我们最常用到的学习率衰减库函数:
lr_decay = torch.optim.lr_scheduler.ReduceLROnPlateau() (学习率衰减方法之一)
解释:这是个类class,初始化函数包含了诸多参数。这里我们只对重点参数进行介绍:
factor是学习率变换时乘的因子;
patience是指,当连续patience个epoch而指标(loss或accuracy)没有变换时,我们对学习率进行更改;
threshold指,当性能变化小于该值时,即认为没有性能变化;
min_lr模型所允许的最小学习率,即学习率不能小于该值;
eps,学习率变化的最小值。当学习率变化小于该值时,忽略掉;
lr_decay.step(val_loss/accuracy),学习率更新的指标依据。当某指标loss(或accuracy)在最近几个epoch中都没有变化下降(或升高)超过给定阈值时,调整学习率。如当验证集的loss不再下降时,调整学习率;或监测验证集的accuracy不再升高时,调整学习率。
参考文献:
https://blog.csdn.net/weixin_40100431/article/details/84311430
https://zhuanlan.zhihu.com/p/69411064
https://www.jianshu.com/p/26a7dbc15246
https://www.cnblogs.com/xym4869/p/11654611.html
https://blog.csdn.net/qyhaill/article/details/103043637
https://blog.csdn.net/emperinter/article/details/108917935
https://www.emperinter.info/2020/08/05/change-leaning-rate-by-reducelronplateau-in-pytorch/
(三)optimizer.zero_grad()
当网络参量进行反馈时,梯度是累积计算而不是被替换,但在处理每一个batch时并不需要与其他batch的梯度混合起来累积计算,因此需要对每个batch调用一遍optimizer.zero_grad()将优化器中的参数梯度置0。之后执行loss.backward()来反向传播计算梯度,最后使用optimizer.step()来更新优化器选中的参数梯度。
# 将参数(权值)梯度全部置为0 optimizer.zero_grad() # 通过输入,计算模型的输出 outputs = net(inputs) # 计算损失 loss = criterion(outputs, labels) # 通过损失,反向传播计算权值梯度 loss.backward() # 更新权值 optimizer.step()
(四)optimizer和scheduler
optimizer的作用在于根据选用的优化器以及设置相应的lr、momentum等(超参数)对模型参数(优化器中的参数)进行更新,更新的方法是optimizer.step()。
scheduler的作用在于对optimizer中的学习率进行更新、调整,更新的方法是scheduler.step()。
通常而言,在一个batch_size内先进行optimizer.step()完成权重参数的更新过程,然后再进行scheduler.step()完成对学习率参数的更新过程。
注:scheduler.step()函数用来完成optimizer实例中学习率的更新,如果没有scheduler中的step方法,也就无法对optimizer中的学习率进行调整 。
参考文献:
https://zhuanlan.zhihu.com/p/367999849
https://blog.csdn.net/qq_40178291/article/details/99963586
https://www.jianshu.com/p/1db8581edd4f
这里有个resume optimizer学习率的样例不错:https://zhuanlan.zhihu.com/p/136902153