相信很多人都知道Hugging Face,也都用过它的Transformers预训练语言模型,但你们有没有觉得它训练的有点太慢了呢?这时候,手把手教你怎么让训练时间缩短一半。
训练BERT
首先我们要安装Transformers库,这很简单:
- pip install transformers
然后我们直接把官方的例子拷贝下来,这里我们用的是GLUE任务,地址是https://github.com/huggingface/transformers/blob/master/examples/pytorch/text-classification/run_glue.py。因为代码太长了,这里就不放了,拷贝下来后文件名是run_glue.py
。
接着我们就可以直接运行这个代码了,我们采用mrpc数据集,开启FP16训练,命令如下:
- python run_glue.py \
- --model_name_or_path bert-base-cased \
- --task_name mrpc \
- --do_train \
- --do_eval \
- --max_seq_length 128 \
- --per_device_train_batch_size 32 \
- --num_train_epochs 3 \
- --output_dir /tmp/mrpc/ \
- --overwrite_output_dir \
- --fp16
我这里是单卡训练的,训练完后输出如下:
- ***** train metrics *****
- epoch = 3.0
- train_loss = 0.3921
- train_runtime = 0:00:45.06
- train_samples = 3668
- train_samples_per_second = 244.166
- train_steps_per_second = 7.655
可以看出,训练总共耗时「45秒」,是不是有点等不及了呢?
加速训练
首先我们需要安装训练加速库,这里我们用到的是LightSeq,项目地址是https://github.com/bytedance/lightseq。不过我们还是直接pip
安装:
- pip install lightseq
然后我们需要做的就是将Hugging Face的BERT替换成LightSeq的BERT,代码如下,放在文件replace_module.py
中。
- from lightseq.training.ops.pytorch.transformer_encoder_layer import (
- LSTransformerEncoderLayer,
- )
- class LSHFTransformerEncoderLayer(LSTransformerEncoderLayer):
- def __init__(self, *args, **kwargs):
- super(LSHFTransformerEncoderLayer, self).__init__(*args, **kwargs)
- def forward(self, hidden_states, encoder_padding_mask, *args, **kwargs):
- encoder_padding_mask /= -10000.0
- output = super().forward(hidden_states, encoder_padding_mask)
- return (output, None, None, None)
- def gen_ls_bert_config(training_args, config):
- bert_config = LSTransformerEncoderLayer.get_config(
- max_batch_tokens=4096,
- max_seq_len=config.max_position_embeddings,
- hidden_size=config.hidden_size,
- intermediate_size=config.intermediate_size,
- nhead=config.num_attention_heads,
- attn_prob_dropout_ratio=config.attention_probs_dropout_prob,
- activation_dropout_ratio=0.1,
- hidden_dropout_ratio=config.hidden_dropout_prob,
- pre_layer_norm=False,
- fp16=training_args.fp16,
- local_rank=training_args.local_rank,
- )
- return bert_config
- def inject_ls_enc_layer(model, training_args, config):
- for i in range(config.num_hidden_layers):
- bert_config = gen_ls_bert_config(training_args, config)
- model.bert.encoder.layer[i] = LSHFTransformerEncoderLayer(bert_config)
这里LSHFTransformerEncoderLayer
是继承的LightSeq中的LSTransformerEncoderLayer
类,然后重写了forward
函数。原因是Hugging Face的输入格式和LightSeq略有不同,需要在forward
之前转换一下。
gen_ls_bert_config
函数是用来定义LightSeq的encoder参数配置,这里直接从Hugging Face的主函数入口获取即可。
inject_ls_enc_layer
函数就是用来替换BERT中的每一层encoder的,首先定义每一层的参数配置,然后用LSHFTransformerEncoderLayer
类去替换原始的encoder层即可。
然后我们打开run_glue.py
,在头文件处加上inject_ls_enc_layer
的引用:
- from replace_module import inject_ls_enc_layer
最后在定义完model后,将model中的encoder替换即可,利用上面引用的替换函数:
- model = AutoModelForSequenceClassification.from_pretrained(
- model_args.model_name_or_path,
- from_tf=bool(".ckpt" in model_args.model_name_or_path),
- config=config,
- cache_dir=model_args.cache_dir,
- revision=model_args.model_revision,
- use_auth_token=True if model_args.use_auth_token else None,
- )
- # 在model定义后立刻替换
- inject_ls_enc_layer(model, training_args, config)
我们重新运行上一次运行的命令:
- python run_glue.py \
- --model_name_or_path bert-base-cased \
- --task_name mrpc \
- --do_train \
- --do_eval \
- --max_seq_length 128 \
- --per_device_train_batch_size 32 \
- --num_train_epochs 3 \
- --output_dir /tmp/mrpc/ \
- --overwrite_output_dir \
- --fp16
最终输出如下:
- ***** train metrics *****
- epoch = 3.0
- train_loss = 0.6077
- train_runtime = 0:00:25.08
- train_samples = 3668
- train_samples_per_second = 438.603
- train_steps_per_second = 13.751
这次运行时间只有「25秒」!不愧是字节最快的男人。
加载预训练参数
有眼尖的小伙伴可能发现了,上面加速后效果变差了呀。没错,因为新建了encoder类之后,参数都是随机初始化的了,所以要重新加载一下预训练参数。
LightSeq的encoder类初始化的时候提供了预训练参数初始化的选项,我们只需要将预训练参数从Hugging Face的BERT中提取出来即可:
- def get_hf_bert_enc_layer_params(layer):
- init_ws = []
- init_bs = []
- init_ws.append(layer.attention.self.query.weight.detach().clone())
- init_bs.append(layer.attention.self.query.bias.detach().clone())
- init_ws.append(layer.attention.self.key.weight.detach().clone())
- init_bs.append(layer.attention.self.key.bias.detach().clone())
- init_ws.append(layer.attention.self.value.weight.detach().clone())
- init_bs.append(layer.attention.self.value.bias.detach().clone())
- init_ws.append(layer.attention.output.dense.weight.detach().clone())
- init_bs.append(layer.attention.output.dense.bias.detach().clone())
- init_ws.append(layer.attention.output.LayerNorm.weight.detach().clone())
- init_bs.append(layer.attention.output.LayerNorm.bias.detach().clone())
- init_ws.append(layer.intermediate.dense.weight.detach().clone())
- init_bs.append(layer.intermediate.dense.bias.detach().clone())
- init_ws.append(layer.output.dense.weight.detach().clone())
- init_bs.append(layer.output.dense.bias.detach().clone())
- init_ws.append(layer.output.LayerNorm.weight.detach().clone())
- init_bs.append(layer.output.LayerNorm.bias.detach().clone())
- return init_ws, init_bs
注意参数在列表中的顺序不能错了,然后将这两个列表加入到LSHFTransformerEncoderLayer
类的初始化参数中去:
- def inject_ls_enc_layer(model, training_args, config):
- for i in range(config.num_hidden_layers):
- bert_config = gen_ls_bert_config(training_args, config)
- # 提取预训练参数
- init_ws, init_bs = get_hf_bert_enc_layer_params(model.bert.encoder.layer[i])
- # 利用预训练参数进行初始化
- model.bert.encoder.layer[i] = LSHFTransformerEncoderLayer(
- bert_config, init_ws, init_bs
- )
接着运行命令不变,效果就上来啦。
和竞品比如何?
另一款知名的训练加速库DeepSpeed你们可能也听过,那和它比速度怎么样呢?
Hugging Face已经内置了DeepSpeed,可以直接开启。不过它并没有替换掉encoder,所以模型还是用PyTorch写的,速度依然很慢。因此我们需要手动替换一下encoder。
代码和上面类似,也是定义参数配置和encoder类:
- from deepspeed.ops.transformer import (
- DeepSpeedTransformerConfig,
- DeepSpeedTransformerLayer
- )
- def gen_ds_bert_config(training_args, config):
- bert_config = DeepSpeedTransformerConfig(
- batch_size=4096,
- hidden_size=config.hidden_size,
- intermediate_size=config.intermediate_size,
- heads=config.num_attention_heads,
- attn_dropout_ratio=config.attention_probs_dropout_prob,
- hidden_dropout_ratio=config.hidden_dropout_prob,
- num_hidden_layers=config.num_hidden_layers,
- initializer_range=0.02,
- layer_norm_eps=1e-8,
- local_rank=training_args.local_rank,
- fp16=training_args.fp16,
- pre_layer_norm=False,
- huggingface=True,
- training=True
- )
- return bert_config
- def inject_ds_enc_layer(model, training_args, config):
- for i in range(config.num_hidden_layers):
- bert_config = gen_ds_bert_config(training_args, config)
- model.bert.encoder.layer[i] = DeepSpeedTransformerLayer(bert_config)
然后在run_glue.py
里引用inject_ds_enc_layer
替换函数,并对model进行替换:
- from replace_module import inject_ds_enc_layer
- model = AutoModelForSequenceClassification.from_pretrained(
- model_args.model_name_or_path,
- from_tf=bool(".ckpt" in model_args.model_name_or_path),
- config=config,
- cache_dir=model_args.cache_dir,
- revision=model_args.model_revision,
- use_auth_token=True if model_args.use_auth_token else None,
- )
- # 在model定义后立刻替换
- inject_ds_enc_layer(model, training_args, config)
最后我们还需要定义一个DeepSpeed需要用到的运行参数配置ds_config.json
:
- {
- "train_micro_batch_size_per_gpu": "auto",
- "optimizer": {
- "type": "AdamW",
- "params": {
- "lr": "auto",
- "betas": [
- 0.9,
- 0.999
- ],
- "eps": 1e-8,
- "weight_decay": "auto",
- "torch_adam": true
- }
- },
- "scheduler": {
- "type": "WarmupDecayLR",
- "params": {
- "warmup_num_steps": "auto",
- "warmup_min_lr": "auto",
- "warmup_max_lr": "auto",
- "total_num_steps": "auto"
- }
- },
- "gradient_clipping": "auto",
- "fp16": {
- "enabled": "auto",
- "loss_scale": 0,
- "initial_scale_power": 7
- }
- }
运行命令需要稍稍修改,采用DeepSpeed的启动器:
- deepspeed --num_gpus=1 run_glue.py \
- --model_name_or_path bert-base-cased \
- --task_name mrpc \
- --do_train \
- --do_eval \
- --max_seq_length 128 \
- --per_device_train_batch_size 32 \
- --num_train_epochs 3 \
- --output_dir /tmp/mrpc/ \
- --overwrite_output_dir \
- --fp16 \
- --deepspeed ds_config.json
输出结果如下:
- ***** train metrics *****
- epoch = 3.0
- train_loss = 0.5865
- train_runtime = 0:00:37.17
- train_samples = 3668
- train_samples_per_second = 296.032
- train_steps_per_second = 9.281
总结
最终对比下来,Hugging Face花了「45秒」训练完成,DeepSpeed花了「37秒」,而LightSeq只花了「25秒」。
「项目地址:」
https://github.com/bytedance/lightseq
「技术原理:」
https://zhuanlan.zhihu.com/p/383657837
「其它使用例子:」
https://zhuanlan.zhihu.com/p/382961951