• [源码解析] 模型并行分布式训练Megatron (2) 整体架构


    [源码解析] 模型并行分布式训练Megatron (2) --- 整体架构

    0x00 摘要

    NVIDIA Megatron 是一个基于 PyTorch 的分布式训练框架,用来训练超大Transformer语言模型,其通过综合应用了数据并行,Tensor并行和Pipeline并行来复现 GPT3,值得我们深入分析其背后机理。

    本系列大概有6~7篇文章,通过论文和源码和大家一起学习研究。本文将对 Megatron 的基本架构做一下梳理。

    本系列其他文章为:

    [源码解析] 模型并行分布式训练Megatron (1) --- 论文 & 基础

    0x01 启动

    1.1 分布式启动

    启动脚本在 examples/pretrain_bert_distributed.sh,其利用了 torch.distributed.launch 来启动多个进程。具体业务代码是 pretrain_bert.py。

    因为 GPUS_PER_NODE 是8,所以 nproc_per_node 是8,这样,在本机上就启动了8个进程,每个进程之中含有模型的一部分进程的 rank 是被 torch.distributed.launch 调用 elastic 自动分配的

    #!/bin/bash
    
    GPUS_PER_NODE=8
    # Change for multinode config
    MASTER_ADDR=localhost
    MASTER_PORT=6000
    NNODES=1
    NODE_RANK=0
    WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
    
    DATA_PATH=<Specify path and file prefix>_text_sentence
    CHECKPOINT_PATH=<Specify path>
    
    DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
    
    python -m torch.distributed.launch $DISTRIBUTED_ARGS \
           pretrain_bert.py \
           --num-layers 24 \
           --hidden-size 1024 \
           --num-attention-heads 16 \
           --micro-batch-size 4 \
           --global-batch-size 32 \
           --seq-length 512 \
           --max-position-embeddings 512 \
           --train-iters 1000000 \
           --save $CHECKPOINT_PATH \
           --load $CHECKPOINT_PATH \
           --data-path $DATA_PATH \
           --vocab-file bert-vocab.txt \
           --data-impl mmap \
           --split 949,50,1 \
           --distributed-backend nccl \
           --lr 0.0001 \
           --lr-decay-style linear \
           --min-lr 1.0e-5 \
           --lr-decay-iters 990000 \
           --weight-decay 1e-2 \
           --clip-grad 1.0 \
           --lr-warmup-fraction .01 \
           --log-interval 100 \
           --save-interval 10000 \
           --eval-interval 1000 \
           --eval-iters 10 \
           --fp16
    

    1.2 构造基础

    pretrain_bert.py 会调用 pretrain 进行预训练。

    if __name__ == "__main__":
    
        pretrain(train_valid_test_datasets_provider, model_provider,
                 ModelType.encoder_or_decoder,
                 forward_step, args_defaults={'tokenizer_type': 'BertWordPieceLowerCase'})
    

    1.2.1 获取模型

    model_provider返回模型普通版本(vanilla version)。所谓vanilla,我们指的是一个简单的cpu模型,没有 fp16或 ddp,但是已经被 Megatron 改造为并行的版本。

    def model_provider(pre_process=True, post_process=True):
        """Build the model."""
    
        print_rank_0('building BERT model ...')
    
        args = get_args()
        num_tokentypes = 2 if args.bert_binary_head else 0
        model = BertModel(
            num_tokentypes=num_tokentypes,
            add_binary_head=args.bert_binary_head,
            parallel_output=True,
            pre_process=pre_process,
            post_process=post_process)
    
        return model
    

    1.2.2 获取数据集

    train_valid_test_datasets_provider 会接受train/valid/test数据集的大小,并返回 “train,valid,test” 数据集。

    def train_valid_test_datasets_provider(train_val_test_num_samples):
        """Build train, valid, and test datasets."""
        args = get_args()
    
        print_rank_0('> building train, validation, and test datasets '
                     'for BERT ...')
        train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
            data_prefix=args.data_path,
            data_impl=args.data_impl,
            splits_string=args.split,
            train_valid_test_num_samples=train_val_test_num_samples,
            max_seq_length=args.seq_length,
            masked_lm_prob=args.mask_prob,
            short_seq_prob=args.short_seq_prob,
            seed=args.seed,
            skip_warmup=(not args.mmap_warmup),
            binary_head=args.bert_binary_head)
        print_rank_0("> finished creating BERT datasets ...")
    
        return train_ds, valid_ds, test_ds
    

    1.2.3 步进函数

    forward_step函数接受一个“数据迭代器”和“模型”,并返回一个“loss”标量,该标量带有一个字典,其中key:value是希望在训练期间监视的信息,例如“lm loss:value”。还要求此函数将“batch generator”添加到timers类中。

    def forward_step(data_iterator, model):
        """Forward step."""
        args = get_args()
    
        # Get the batch.
        tokens, types, sentence_order, loss_mask, lm_labels, padding_mask = get_batch(
            data_iterator)
    
        if not args.bert_binary_head:
            types = None
    
        # Forward pass through the model.
        output_tensor = model(tokens, padding_mask, tokentype_ids=types,
                              lm_labels=lm_labels)
    
        return output_tensor, partial(loss_func, loss_mask, sentence_order)
    
    1.2.3.1 广播数据

    forward_step 会调用 get_batch 获取batch 数据,其内部会从迭代器获取数据,然后使用broadcast_data函数把输入数据从 rank 0 广播到所有tensor-model-parallel 其他 ranks之上。

    注意,数据并行是把不同数据加载到不同的rank之上,而 Tensor模型并行组之中每个rank都加载同样数据

    def get_batch(data_iterator):
        """Build the batch."""
    
        # Items and their type.
        keys = ['text', 'types', 'labels', 'is_random', 'loss_mask', 'padding_mask']
        datatype = torch.int64
    
        # Broadcast data.
        if data_iterator is not None:
            data = next(data_iterator) # 获取数据
        else:
            data = None
        data_b = mpu.broadcast_data(keys, data, datatype) # 把数据广播到各个GPU
    
        # Unpack.
        tokens = data_b['text'].long()
        types = data_b['types'].long()
        sentence_order = data_b['is_random'].long()
        loss_mask = data_b['loss_mask'].float()
        lm_labels = data_b['labels'].long()
        padding_mask = data_b['padding_mask'].long()
    
        return tokens, types, sentence_order, loss_mask, lm_labels, padding_mask
    

    broadcast_data 在每个model parallel group之上,把数据从rank 0发送到同组其他成员。

    def broadcast_data(keys, data, datatype):
        """Broadcast data from rank zero of each model parallel group to the
        members of the same model parallel group.
    
        Arguments:
            keys: list of keys in the data disctionary to be broadcasted
            data: data dictionary of string keys and cpu tensor values.
            datatype: torch data type of all tensors in data associated
                      with keys.
        """
        # Build (key, size) and (key, number of elements) dictionaries along
        # with the total number of elements on all ranks.
        key_size, key_numel, total_numel = _build_key_size_numel_dictionaries(keys,
                                                                              data)
    
        # Pack on rank zero.
        if get_tensor_model_parallel_rank() == 0: # rank 0才压缩
            # Check that all keys have the same data type.
            _check_data_types(keys, data, datatype)
            # Flatten the data associated with the keys
            flatten_data = torch.cat(
                [data[key].contiguous().view(-1) for key in keys], dim=0).cuda()
        else:
            flatten_data = torch.empty(total_numel,
                                       device=torch.cuda.current_device(),
                                       dtype=datatype)
    
        # Broadcast
        torch.distributed.broadcast(flatten_data, get_tensor_model_parallel_src_rank(),
                                    group=get_tensor_model_parallel_group())
    
        # Unpack
        output = {}
        offset = 0
        for key in keys:
            size = key_size[key]
            numel = key_numel[key]
            output[key] = flatten_data.narrow(0, offset, numel).view(size)
            offset += numel
    
        return output
    

    get_tensor_model_parallel_src_rank 计算与张量模型并行组中第一个local rank对应的全局rank。

    def get_tensor_model_parallel_src_rank():
        """Calculate the global rank corresponding to the first local rank
        in the tensor model parallel group."""
        global_rank = torch.distributed.get_rank()
        local_world_size = get_tensor_model_parallel_world_size()
        return (global_rank // local_world_size) * local_world_size
    

    逻辑图具体如下,三个不同的函数分别为预训练提供不同的功能输入,做到了解耦。

    0x02 Pretrain

    BERT训练主要分为两步:

    • Pre-train:pre-train是迁移学习的基础,是训练token-level的语义理解。
    • Fine-tuning:在已经训练好的语言模型基础之上,加入特定领域(比如金融医疗)的参数来重新训练,比如对于分类问题就可以在pre-train模型基础之上加上一个softmax,再使用语料 fine-tune。

    Pre-train 主要如下:

    • 初始化Megatron。

    • 使用model_provider设置模型、优化器和lr计划。

    • 调用train_val_test_data_provider以获取train/val/test数据集。

    • 使用forward_step_func训练模型。

    具体代码如下:

    def pretrain(train_valid_test_dataset_provider,
                 model_provider,
                 model_type,
                 forward_step_func,
                 extra_args_provider=None,
                 args_defaults={}):
        """Main training program.
    
        This function will run the followings in the order provided:
            1) initialize Megatron.
            2) setup model, optimizer and lr schedule using the model_provider.
            3) call train_val_test_data_provider to get train/val/test datasets.
            4) train the modle using the forward_step_func.
        """
    
        # Initalize and get arguments, timers, and Tensorboard writer.
        initialize_megatron(extra_args_provider=extra_args_provider,
                            args_defaults=args_defaults)
    
        # Adjust the startup time so it reflects the largest value.
        # This will be closer to what scheduler will see (outside of
        # image ... launches.
        global _TRAIN_START_TIME
        start_time_tensor = torch.cuda.DoubleTensor([_TRAIN_START_TIME])
        torch.distributed.all_reduce(start_time_tensor,
                                     op=torch.distributed.ReduceOp.MIN)
        _TRAIN_START_TIME = start_time_tensor.item()
    
        args = get_args()
        timers = get_timers()
    
        # Model, optimizer, and learning rate. 使用model_provider设置模型、优化器和lr计划
        model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider,
                                                                   model_type)
    
        # Data stuff. 调用train_val_test_data_provider以获取train/val/测试数据集
        if args.virtual_pipeline_model_parallel_size is not None:
            all_data_iterators = [
                build_train_valid_test_data_iterators(train_valid_test_dataset_provider)
                for _ in range(len(model))
            ]
            train_data_iterator = [data_iterators[0] for data_iterators in all_data_iterators]
            valid_data_iterator = [data_iterators[1] for data_iterators in all_data_iterators]
            test_data_iterator = [data_iterators[2] for data_iterators in all_data_iterators]
        else:
            train_data_iterator, valid_data_iterator, test_data_iterator \
                = build_train_valid_test_data_iterators(
                    train_valid_test_dataset_provider)
    
        iteration = 0
        if args.do_train and args.train_iters > 0:
            iteration = train(forward_step_func, # 训练模型
                              model, optimizer, lr_scheduler,
                              train_data_iterator, valid_data_iterator)
    
        if args.do_valid:
            prefix = 'the end of training for val data'
            evaluate_and_print_results(prefix, forward_step_func,
                                       valid_data_iterator, model,
                                       iteration, False)
    
        if args.save and iteration != 0:
            save_checkpoint(iteration, model, optimizer, lr_scheduler)
    
        if args.do_test:
            # Run on test data.
            prefix = 'the end of training for test data'
            evaluate_and_print_results(prefix, forward_step_func,
                                       test_data_iterator, model,
                                       0, True)
    

    对于我们分析来说,initialize_megatron 是重点,这里初始化了 megatron。

    0x03 初始化

    3.1 initialize_megatron

    initialize_megatron 方法会设置全局变量,初始化分布式环境等等。

    def initialize_megatron(extra_args_provider=None, args_defaults={},
                            ignore_unknown_args=False, allow_no_cuda=False):
        """Set global variables, initialize distributed, and
        set autoresume and random seeds.
        `allow_no_cuda` should not be set unless using megatron for cpu only 
        data processing. In general this arg should not be set unless you know 
        what you are doing.
        Returns a function to finalize distributed env initialization 
        (optionally, only when args.lazy_mpu_init == True)
        """
        if not allow_no_cuda:
            # Make sure cuda is available.
            assert torch.cuda.is_available(), 'Megatron requires CUDA.'
    
        # Parse args, build tokenizer, and set adlr-autoresume,
        # tensorboard-writer, and timers.
        set_global_variables(extra_args_provider=extra_args_provider, # 设置全局变量
                             args_defaults=args_defaults,
                             ignore_unknown_args=ignore_unknown_args)
    
        # torch.distributed initialization
        def finish_mpu_init():
            args = get_args()
            # Pytorch distributed.
            _initialize_distributed() # 设置分布式
            
            # Random seeds for reproducibility.
            if args.rank == 0:
                print('> setting random seeds to {} ...'.format(args.seed))
            _set_random_seed(args.seed)
    
        # Set pytorch JIT layer fusion options.
        _set_jit_fusion_options()
    
        args = get_args()
        if  args.lazy_mpu_init:
            args.use_cpu_initialization=True
            # delayed initialization of DDP-related stuff
            # We only set basic DDP globals    
            set_tensor_model_parallel_world_size(args.tensor_model_parallel_size)
            # and return function for external DDP manager
            # to call when it has DDP initialized
            set_tensor_model_parallel_rank(args.rank)    
            return finish_mpu_init
        else:
            # Megatron's MPU is the master. Complete initialization right away.
            finish_mpu_init()
    
            # Autoresume.
            _init_autoresume()
    
            # Compile dependencies.
            _compile_dependencies()
    
            # No continuation function
            return None
    

    3.2 初始化分布式环境

    _initialize_distributed 代码位于 megatron/initialize.py,此方法会:

    • 调用 torch.distributed.init_process_group 初始化分布式环境。
    • 调用 mpu.initialize_model_parallel 来设置模型并行,数据并行等各种进程组,我们下文会重点讨论。

    创建完worker进程之后,程序需要知道哪些进程在训练同一个模型,torch.distributed.init_process_group 就实现了这个功能。torch.distributed.init_process_group 会生成一个进程组,同组内进程训练同一个模型,也能确定用什么方式进行通信。进程组会给组内每个进程一个序号,就是gloabl rank,如果是多机并行,每个机器创建的进程之间也有一个序号,就是 local rank。如果是单机多卡并行,local rank 和 global rank是一致的。

    def _initialize_distributed():
        """Initialize torch.distributed and mpu."""
        args = get_args()
    
        device_count = torch.cuda.device_count()
        if torch.distributed.is_initialized():
            args.rank = torch.distributed.get_rank()
            args.world_size = torch.distributed.get_world_size()
        else:
            # Manually set the device ids.
            if device_count > 0:
                device = args.rank % device_count
                if args.local_rank is not None:
                    assert args.local_rank == device, \
                        'expected local-rank to be the same as rank % device-count.'
                else:
                    args.local_rank = device
                torch.cuda.set_device(device)
        # Call the init process
        torch.distributed.init_process_group( # 初始化PyTorch分布式环境
            backend=args.distributed_backend,
            world_size=args.world_size, rank=args.rank,
            timeout=timedelta(minutes=10))
    
        # Set the tensor model-parallel, pipeline model-parallel, and
        # data-parallel communicators.
        if device_count > 0:
            if mpu.model_parallel_is_initialized():
                print('model parallel is already initialized')
            else:
    					  # 初始化模型并行,比如设置各种进程组
                mpu.initialize_model_parallel(args.tensor_model_parallel_size,
                                              args.pipeline_model_parallel_size,
                                              args.virtual_pipeline_model_parallel_size,
                                              args.pipeline_model_parallel_split_rank)
    

    3.3 初始化进程组全局变量

    因为调用了 mpu.initialize_model_parallel 来设置模型并行,数据并行等各种进程组,所以我们假定目前进程组都已经设置成功,所以每个 rank 对应的进程都有自己的全局变量。假定目前有16个GPU,属于两个node,rank 0 ~7 属于第一个节点,rank 8 ~ 15 属于第二个节点。下面的 gi 指的是第 i 个 GPU。

    • _TENSOR_MODEL_PARALLEL_GROUP :当前 rank 所属于的Intra-layer model parallel group,就是tensor 并行进程组。
      • 假如每一层分为两个tensor,则 _TENSOR_MODEL_PARALLEL_GROUP 例子为:[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]。
    • _PIPELINE_MODEL_PARALLEL_GROUP :当前 rank 所属于的Intra-layer model parallel group,就是流水线进程组。
      • 假如流水线深度为4,则例子为 [g0, g4, g8, g12], [g1, g5, g9, g13], [g2, g6, g10, g14], [g3, g7, g11, g15]。
    • _MODEL_PARALLEL_GROUP :当前 rank 所属于的模型并行进程组,包括了以上两组。
      • 针对我们例子,就是完整模型被复制了两份,两份分别对应的 GPU 具体是[0, 1, 4, 5, 8, 9, 12, 13],[2, 3, 6, 7, 10, 11, 14, 15]
    • _EMBEDDING_GROUP : 嵌入对应的进程组。
    • _DATA_PARALLEL_GROUP :当前 rank 所属于的Data parallel group。
      • 假如数据并行度数为2,则例子为[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]。
    # Intra-layer model parallel group that the current rank belongs to.
    _TENSOR_MODEL_PARALLEL_GROUP = None
    # Inter-layer model parallel group that the current rank belongs to.
    _PIPELINE_MODEL_PARALLEL_GROUP = None
    # Model parallel group (both intra- and pipeline) that the current rank belongs to.
    _MODEL_PARALLEL_GROUP = None
    # Embedding group.
    _EMBEDDING_GROUP = None
    # Data parallel group that the current rank belongs to.
    _DATA_PARALLEL_GROUP = None
    

    0x04 设置模型

    在 Pretrain 之中,会调用如下来设置模型,优化器等等。

    # Model, optimizer, and learning rate. 使用model_provider设置模型、优化器和lr计划
    model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider,
                                                               model_type)
    

    4.1 setup_model_and_optimizer

    setup_model_and_optimizer 方法会设置模型和优化器,其中重点是get_model。

    def setup_model_and_optimizer(model_provider_func, model_type):
        """Setup model and optimizer."""
        args = get_args()
        model = get_model(model_provider_func, model_type)
        unwrapped_model = unwrap_model(model,
                                       (torchDDP, LocalDDP, Float16Module))
        optimizer = get_megatron_optimizer(unwrapped_model)
        lr_scheduler = get_learning_rate_scheduler(optimizer)
    
        if args.load is not None:
            timers = get_timers()
            # Extra barrier is added to make sure all ranks report the
            # max time.
            torch.distributed.barrier()
            args.iteration = load_checkpoint(model, optimizer, lr_scheduler)
            torch.distributed.barrier()
        else:
            args.iteration = 0
    
        # We only support local DDP with multiple micro-batches.
        if len(model) > 1 or mpu.get_pipeline_model_parallel_world_size() > 1:
            assert args.DDP_impl == 'local'
    
        # get model without FP16 and/or TorchDDP wrappers
        if args.iteration == 0 and len(unwrapped_model) == 1 \
            and hasattr(unwrapped_model[0], 'init_state_dict_from_bert'):
            unwrapped_model[0].init_state_dict_from_bert()
            if args.fp16:
                optimizer.reload_model_params()
    
        return model, optimizer, lr_scheduler
    

    4.2 模型

    4.2.1 BertModel

    我们首先看看 BertModel 的初始化函数,略过其他功能函数。其主要调用了 get_language_model。

    class BertModel(MegatronModule):
        """Bert Language model."""
    
        def __init__(self,
                     num_tokentypes=2,
                     add_binary_head=True,
                     parallel_output=True,
                     pre_process=True,
                     post_process=True):
            super(BertModel, self).__init__()
            args = get_args()
    
            self.fp16_lm_cross_entropy = args.fp16_lm_cross_entropy
            self.add_binary_head = add_binary_head
            self.parallel_output = parallel_output
            self.pre_process = pre_process
            self.post_process = post_process
    
            init_method = init_method_normal(args.init_method_std)
            scaled_init_method = scaled_init_method_normal(args.init_method_std,
                                                           args.num_layers)
    
    				# 获取语言模型
            self.language_model, self._language_model_key = get_language_model(
                num_tokentypes=num_tokentypes,
                add_pooler=self.add_binary_head,
                encoder_attn_mask_type=AttnMaskType.padding,
                init_method=init_method,
                scaled_init_method=scaled_init_method,
                pre_process=self.pre_process,
                post_process=self.post_process)
    
            self.initialize_word_embeddings(init_method_normal)
            if self.post_process: # 如果是最后一层,会特殊处理
                self.lm_head = BertLMHead(
                    self.word_embeddings_weight().size(0),
                    args.hidden_size, init_method, args.layernorm_epsilon, parallel_output)
                self._lm_head_key = 'lm_head'
                self.binary_head = None
                if self.add_binary_head:
                    self.binary_head = get_linear_layer(args.hidden_size, 2,
                                                        init_method)
                    self._binary_head_key = 'binary_head'
    

    4.2.2 语言模型

    get_language_model 会获取一个 TransformerLanguageModel。

    def get_language_model(num_tokentypes, add_pooler,
                           encoder_attn_mask_type, init_method=None,
                           scaled_init_method=None, add_encoder=True,
                           add_decoder=False,
                           decoder_attn_mask_type=AttnMaskType.causal,
                           pre_process=True, post_process=True):
        """Build language model and return along with the key to save."""
        args = get_args()
    
        if init_method is None:
            init_method = init_method_normal(args.init_method_std)
    
        if scaled_init_method is None:
            scaled_init_method = scaled_init_method_normal(args.init_method_std,
                                                           args.num_layers)
    
        # Language model.
        language_model = TransformerLanguageModel(
            init_method,
            scaled_init_method,
            encoder_attn_mask_type,
            num_tokentypes=num_tokentypes,
            add_encoder=add_encoder,
            add_decoder=add_decoder,
            decoder_attn_mask_type=decoder_attn_mask_type,
            add_pooler=add_pooler,
            pre_process=pre_process,
            post_process=post_process
        )
        # key used for checkpoints.
        language_model_key = 'language_model'
    
        return language_model, language_model_key
    

    TransformerLanguageModel 就是具体的语言模型,其中重要的是 ParallelTransformer。这里会依据传入的配置来进行生成。

    • 如果是第一层,即有 pre_process,则会加入 embedding layer。
    • 如果是中间层,则会根据 encoder 还是 decoder 来生成对应的 ParallelTransformer。
    • 如果是最后一层,即有 post_process,则会加入 Pooler,在外层 BertModel 也会有对应处理。
    class TransformerLanguageModel(MegatronModule):
        """Transformer language model.
    
        Arguments:
            transformer_hparams: transformer hyperparameters
            vocab_size: vocabulary size
            max_sequence_length: maximum size of sequence. This
                                 is used for positional embedding
            embedding_dropout_prob: dropout probability for embeddings
            num_tokentypes: size of the token-type embeddings. 0 value
                            will ignore this embedding
        """
    
        def __init__(self,
                     init_method,
                     output_layer_init_method,
                     encoder_attn_mask_type,
                     num_tokentypes=0,
                     add_encoder=True,
                     add_decoder=False,
                     decoder_attn_mask_type=AttnMaskType.causal,
                     add_pooler=False,
                     pre_process=True,
                     post_process=True):
            super(TransformerLanguageModel, self).__init__()
            args = get_args()
    
            self.pre_process = pre_process
            self.post_process = post_process
            self.hidden_size = args.hidden_size
            self.num_tokentypes = num_tokentypes
            self.init_method = init_method
            self.add_encoder = add_encoder
            self.encoder_attn_mask_type = encoder_attn_mask_type
            self.add_decoder = add_decoder
            self.decoder_attn_mask_type = decoder_attn_mask_type
            self.add_pooler = add_pooler
            self.encoder_hidden_state = None
    
            # Embeddings.
            if self.pre_process:
                self.embedding = Embedding(self.hidden_size,
                                           args.padded_vocab_size,
                                           args.max_position_embeddings,
                                           args.hidden_dropout,
                                           self.init_method,
                                           self.num_tokentypes)
                self._embedding_key = 'embedding'
    
            # Transformer.
            # Encoder (usually set to True, False if part of an encoder-decoder
            # architecture and in encoder-only stage).
            if self.add_encoder:
                self.encoder = ParallelTransformer(
                    self.init_method,
                    output_layer_init_method,
                    self_attn_mask_type=self.encoder_attn_mask_type,
                    pre_process=self.pre_process,
                    post_process=self.post_process
                )
                self._encoder_key = 'encoder'
            else:
                self.encoder = None
    
            # Decoder (usually set to False, True if part of an encoder-decoder
            # architecture and in decoder-only stage).
            if self.add_decoder:
                # Temporary assertion until we verify correctness of pipeline parallelism
                # implementation of T5.
                self.decoder = ParallelTransformer(
                    self.init_method,
                    output_layer_init_method,
                    layer_type=LayerType.decoder,
                    self_attn_mask_type=self.decoder_attn_mask_type,
                    pre_process=self.pre_process,
                    post_process=self.post_process)
                self._decoder_key = 'decoder'
            else:
                self.decoder = None
    
            if self.post_process:
                # Pooler.
                if self.add_pooler:
                    self.pooler = Pooler(self.hidden_size, self.init_method)
                    self._pooler_key = 'pooler'
    

    4.2.3 ParallelTransformer

    这里会调用 ParallelTransformerLayer 生成具体的 Transformer层,我们会在后文中进行分析。

    即,ParallelTransformer 包括多个 Transformer,其中每层 Transformer 是一个 ParallelTransformerLayer

    class ParallelTransformer(MegatronModule):
        """Transformer class."""
    
        def __init__(self, init_method, output_layer_init_method,
                     layer_type=LayerType.encoder,
                     self_attn_mask_type=AttnMaskType.padding,
                     pre_process=True, post_process=True):
            super(ParallelTransformer, self).__init__()
            args = get_args()
    
            self.bf16 = args.bf16
            self.fp32_residual_connection = args.fp32_residual_connection
            self.pre_process = pre_process
            self.post_process = post_process
            self.input_tensor = None
    
            # Store activation checkpoiting flag.
            self.activations_checkpoint_method = args.activations_checkpoint_method
            self.activations_checkpoint_num_layers = args.activations_checkpoint_num_layers
            self.distribute_checkpointed_activations = args.distribute_checkpointed_activations
    
            # Number of layers.
            self.num_layers = mpu.get_num_layers( # 获得本Transformer的具体层数
                args, args.model_type == ModelType.encoder_and_decoder)
    
            # Transformer layers.
            def build_layer(layer_number):
                return ParallelTransformerLayer( # 返回一层 Transformmer
                    init_method,
                    output_layer_init_method,
                    layer_number,
                    layer_type=layer_type,
                    self_attn_mask_type=self_attn_mask_type)
            if args.virtual_pipeline_model_parallel_size is not None:
                # Number of layers in each model chunk is the number of layers in the stage,
                # divided by the number of model chunks in a stage.
                self.num_layers = self.num_layers // args.virtual_pipeline_model_parallel_size
                # With 8 layers, 2 stages, and 4 model chunks, we want an assignment of
                # layers to stages like (each list is a model chunk):
                # Stage 0: [0]  [2]  [4]  [6]
                # Stage 1: [1]  [3]  [5]  [7]
                # With 8 layers, 2 stages, and 2 virtual stages, we want an assignment of
                # layers to stages like (each list is a model chunk):
                # Stage 0: [0, 1]  [4, 5]
                # Stage 1: [2, 3]  [6, 7]
                offset = mpu.get_virtual_pipeline_model_parallel_rank() * (
                    args.num_layers // args.virtual_pipeline_model_parallel_size) + \
                    (mpu.get_pipeline_model_parallel_rank() * self.num_layers)
            else:
                # Each stage gets a contiguous set of layers.
                offset = mpu.get_pipeline_model_parallel_rank() * self.num_layers
    
            self.layers = torch.nn.ModuleList( # 生成 num_layers 个 Transformer
                [build_layer(i + 1 + offset) for i in range(self.num_layers)])
    
            if self.post_process:
                # Final layer norm before output.
                self.final_layernorm = LayerNorm(
                    args.hidden_size,
                    eps=args.layernorm_epsilon,
                    no_persist_layer_norm=args.no_persist_layer_norm)
    

    目前逻辑如下,我们假定有两个 transformer:

    4.2.3.1 获取层数

    这里一个重点就是获取层数,即获取本模型在并行处理状况下,应该拥有多少层。如果模型一共64层,流水线深度为16,则并行每个阶段有4层,则本子模型拥有4层。

    def get_num_layers(args, is_encoder_and_decoder_model):
        """Compute the number of transformer layers resident on the current rank."""
        if get_pipeline_model_parallel_world_size() > 1:
            if is_encoder_and_decoder_model:
                assert args.pipeline_model_parallel_split_rank is not None
                num_ranks_in_encoder = args.pipeline_model_parallel_split_rank
                num_ranks_in_decoder = get_pipeline_model_parallel_world_size() - num_ranks_in_encoder
                if is_pipeline_stage_before_split():
                    num_layers = args.num_layers // num_ranks_in_encoder
                else:
                    num_layers = args.num_layers // num_ranks_in_decoder
            else:
                num_layers = args.num_layers // get_pipeline_model_parallel_world_size()
        else:
            num_layers = args.num_layers
        return num_layers
    

    get_pipeline_model_parallel_world_size 获取本流水线组world size数目,就是流水线深度。

    def get_pipeline_model_parallel_world_size():
        """Return world size for the pipeline model parallel group."""
        global _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
        if _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE is not None:
            return _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE
        return torch.distributed.get_world_size(group=get_pipeline_model_parallel_group())
    

    _MPU_PIPELINE_MODEL_PARALLEL_WORLD_SIZE 的意思是流水线深度 p,就是纵向切 p-1刀。比如一共 12 层,纵向切 5 刀,则有 6 个stage,每个 stage 有 2 层。

    4.2.3.2 前向传播

    我们接着看看其前向传播函数,这里主要就是调用内部 ParallelTransformerLayer 的 forward 方法,如果是第一层或者最后一层,则做特殊处理。

    def forward(self, hidden_states, attention_mask,
                encoder_output=None, enc_dec_attn_mask=None,
                inference_params=None):
    
        if self.pre_process:
            # Data format change to avoid explicit tranposes : [b s h] --> [s b h].
            # If the input flag for fp32 residual connection is set, convert for float.
            if self.fp32_residual_connection:
                hidden_states = hidden_states.transpose(0, 1).contiguous().float()
            # Otherwise, leave it as is.
            else:
                hidden_states = hidden_states.transpose(0, 1).contiguous()
        else:
            # See set_input_tensor()
            hidden_states = self.input_tensor
    
        if encoder_output is not None:
             encoder_output = encoder_output.transpose(0, 1).contiguous()
    
        if self.activations_checkpoint_method is not None:
            hidden_states = self._checkpointed_forward(hidden_states,
                                                       attention_mask,
                                                       encoder_output,
                                                       enc_dec_attn_mask)
        else:
            for index in range(self.num_layers):
                layer = self._get_layer(index)
                hidden_states = layer( # 调用ParallelTransformerLayer的forward函数
                    hidden_states,
                    attention_mask,
                    encoder_output=encoder_output,
                    enc_dec_attn_mask=enc_dec_attn_mask,
                    inference_params=inference_params)
    
    
        # Final layer norm.
        if self.post_process:
            # Reverting data format change [s b h] --> [b s h].
            hidden_states = hidden_states.transpose(0, 1).contiguous()
            output = self.final_layernorm(hidden_states)
        else:
            output = hidden_states
        
        return output
    

    4.3 get_model

    现在让我们回到 get_model,把生成模型的流程整理出来。

    BERT之中含有多个transformer,所以直接按照层数切分,每一层是一模一样的transformer layer。前面提到了,在我们样例之中启动了8个进程,每个进程里面有一个子模型,即原始BERT模型的部分层。但是怎么知道每个子模型包含了多少层?答案是:因为已经建立了各种进程组,所以 get_model 方法会依据目前进程组情况进行处理。单个进程内模型获取如下:

    • 如果是有 virtual 设置,则会遍历 virtual size,生成对应数目的模型(BertModel)。
    • 否则如果是 encoder_and_decoder,则针对split进行配置。
    • 设置 tensor model parallel 属性。
    • 把本模型放置到GPU之上。
    • 如果需要数据并行,则配置DDP。

    具体代码如下:

    def get_model(model_provider_func, model_type=ModelType.encoder_or_decoder, wrap_with_ddp=True):
        """Build the model."""
        args = get_args()
        args.model_type = model_type
    
        # Build model.
        if mpu.get_pipeline_model_parallel_world_size() > 1 and \
           args.virtual_pipeline_model_parallel_size is not None: # 有virtual设置,后续会提到
            model = []
            for i in range(args.virtual_pipeline_model_parallel_size): # 遍历virtual
              	# 设置rank,主要是为了看是不是第一层,最后一层
                mpu.set_virtual_pipeline_model_parallel_rank(i) 
                # Set pre_process and post_process only after virtual rank is set.
                pre_process = mpu.is_pipeline_first_stage()
                post_process = mpu.is_pipeline_last_stage()
                this_model = model_provider_func( # 获取原始模型 BertModel
                    pre_process=pre_process,
                    post_process=post_process
                )
                this_model.model_type = model_type
                model.append(this_model) # 模型列表之中添加一个新的 BertModel
        else:
            pre_process = mpu.is_pipeline_first_stage() # 是不是第一层
            post_process = mpu.is_pipeline_last_stage() # 是不是最后一层
            add_encoder = True
            add_decoder = True
            if model_type == ModelType.encoder_and_decoder:
                if mpu.get_pipeline_model_parallel_world_size() > 1:
                    rank = mpu.get_pipeline_model_parallel_rank()
                    split_rank = args.pipeline_model_parallel_split_rank
                    world_size = mpu.get_pipeline_model_parallel_world_size()
                    pre_process = rank == 0 or rank == split_rank  # 是不是第一层
                    post_process = (rank == (split_rank - 1)) or ( # 是不是最后一层
                            rank == (world_size - 1))
                    add_encoder = mpu.is_pipeline_stage_before_split()
                    add_decoder = mpu.is_pipeline_stage_after_split()
                model = model_provider_func( # 获取原始模型
                    pre_process=pre_process,
                    post_process=post_process,
                    add_encoder=add_encoder,
                    add_decoder=add_decoder)
            else:
                model = model_provider_func( # 获取原始模型
                    pre_process=pre_process,
                    post_process=post_process
                )
            model.model_type = model_type
    
        if not isinstance(model, list):
            model = [model]
    
        # Set tensor model parallel attributes if not set.
        # Only parameters that are already tensor model parallel have these
        # attributes set for them. We should make sure the default attributes
        # are set for all params so the optimizer can use them.
        for model_module in model:
            for param in model_module.parameters():
                mpu.set_defaults_if_not_set_tensor_model_parallel_attributes(param)
    
        # GPU allocation.
        for model_module in model: # 把本模型放置到GPU之上
            model_module.cuda(torch.cuda.current_device())
    
        # Fp16 conversion.
        if args.fp16 or args.bf16:
            model = [Float16Module(model_module, args) for model_module in model]
    
        if wrap_with_ddp: # 如果需要数据并行,则配置DDP
            if args.DDP_impl == 'torch':
                i = torch.cuda.current_device()
                model = [torchDDP(model_module, device_ids=[i], output_device=i,
                                  process_group=mpu.get_data_parallel_group())
                         for model_module in model]
    
            elif args.DDP_impl == 'local':
                model = [LocalDDP(model_module,
                                  args.accumulate_allreduce_grads_in_fp32,
                                  args.use_contiguous_buffers_in_local_ddp)
                         for model_module in model]
    
            else:
                raise NotImplementedError('Unknown DDP implementation specified: '
                                          '{}. Exiting.'.format(args.DDP_impl))
    
        return model
    

    单个进程内的逻辑大致如下,这里 torchDDP 的意思是把 BertModel 之中的 module 用 torchDDP 来封装。

    0x05 数据并行

    5.1 设置数据

    build_train_valid_test_data_iterators 方法会对数据进行处理,提供了 train,valid,test 三种不同的数据集。

    def build_train_valid_test_data_iterators(
            build_train_valid_test_datasets_provider):
        """XXX"""
        args = get_args()
        (train_dataloader, valid_dataloader, test_dataloader) = (None, None, None)
    
    
        # Backward compatibility, assume fixed batch size.
        if args.iteration > 0 and args.consumed_train_samples == 0:
            args.consumed_train_samples = args.iteration * args.global_batch_size
        if args.iteration > 0 and args.consumed_valid_samples == 0:
            if args.train_samples is None:
                args.consumed_valid_samples = (args.iteration // args.eval_interval) * \
                    args.eval_iters * args.global_batch_size
    
        # Data loader only on rank 0 of each model parallel group.
        if mpu.get_tensor_model_parallel_rank() == 0:
    
            # Number of train/valid/test samples.
            if args.train_samples:
                train_samples = args.train_samples
            else:
                train_samples = args.train_iters * args.global_batch_size
            eval_iters = (args.train_iters // args.eval_interval + 1) * \
                         args.eval_iters
            test_iters = args.eval_iters
            train_val_test_num_samples = [train_samples,
                                          eval_iters * args.global_batch_size,
                                          test_iters * args.global_batch_size]
    
            # Build the datasets.
            train_ds, valid_ds, test_ds = build_train_valid_test_datasets_provider(
                train_val_test_num_samples)
    
            # Build dataloders.
            train_dataloader = build_pretraining_data_loader(
                train_ds, args.consumed_train_samples)
            valid_dataloader = build_pretraining_data_loader(
                valid_ds, args.consumed_valid_samples)
            test_dataloader = build_pretraining_data_loader(test_ds, 0)
    
            # Flags to know if we need to do training/validation/testing.
            do_train = train_dataloader is not None and args.train_iters > 0
            do_valid = valid_dataloader is not None and args.eval_iters > 0
            do_test = test_dataloader is not None and args.eval_iters > 0
            # Need to broadcast num_tokens and num_type_tokens.
            flags = torch.cuda.LongTensor(
                [int(do_train), int(do_valid), int(do_test)])
        else:
            flags = torch.cuda.LongTensor([0, 0, 0])
    
        # Broadcast num tokens.
        torch.distributed.broadcast(flags,
                                    mpu.get_tensor_model_parallel_src_rank(),
                                    group=mpu.get_tensor_model_parallel_group())
        args.do_train = flags[0].item()
        args.do_valid = flags[1].item()
        args.do_test = flags[2].item()
    
        # Build iterators.
        dl_type = args.dataloader_type
    
        if train_dataloader is not None:
            train_data_iterator = iter(train_dataloader) if dl_type == 'single' \
                                  else iter(cyclic_iter(train_dataloader))
        else:
            train_data_iterator = None
    
        if valid_dataloader is not None:
            valid_data_iterator = iter(valid_dataloader) if dl_type == 'single' \
                                  else iter(cyclic_iter(valid_dataloader))
        else:
            valid_data_iterator = None
    
        if test_dataloader is not None:
            test_data_iterator = iter(test_dataloader) if dl_type == 'single' \
                                 else iter(cyclic_iter(test_dataloader))
        else:
            test_data_iterator = None
    
        return train_data_iterator, valid_data_iterator, test_data_iterator
    

    5.2 DDP

    在 get_model 之中,有如下代码使用 DDP。

    from megatron.model import DistributedDataParallel as LocalDDP
    from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
    
    if wrap_with_ddp:
        if args.DDP_impl == 'torch':
            i = torch.cuda.current_device()
            model = [torchDDP(model_module, device_ids=[i], output_device=i,
                              process_group=mpu.get_data_parallel_group())
                     for model_module in model]
    
        elif args.DDP_impl == 'local':
            model = [LocalDDP(model_module,
                              args.accumulate_allreduce_grads_in_fp32,
                              args.use_contiguous_buffers_in_local_ddp)
                     for model_module in model]
    
        else:
            raise NotImplementedError('Unknown DDP implementation specified: '
                                      '{}. Exiting.'.format(args.DDP_impl))
    

    所以我们看看 megatron 自己的 DDP实现。

    5.2.1 定义

    定义只有注释可以看看,使用连续的(contiguous)内存来存储和累积梯度,每一种类型的张量属于一个统一的内存,可以统一做 allreduce。

    class DistributedDataParallel(DistributedDataParallelBase):
        """DDP with contiguous buffers options to storre and accumulate gradients.
        This class:
            - has the potential to reduce memory fragmentation.
            - provides the option to do the gradient accumulation
              in a type other than the params type (for example fp32)
    
        Arguments:
            module: input model.
            accumulate_allreduce_grads_in_fp32: if true do the gradient accumulation
                and the gradient all-reduce all in in float32. If this option is
                true, we require `use_contiguous_buffers` to be true too.
            use_contiguous_buffers: if true, use a contiguous buffer to store the
                gradients.
        """
    

    5.2.2 初始化

    初始化方法的目的是把同类型梯度连续存储。

    def __init__(self, module,
                 accumulate_allreduce_grads_in_fp32,
                 use_contiguous_buffers):
    
        super(DistributedDataParallel, self).__init__(module)
    
        self.accumulate_allreduce_grads_in_fp32 \
            = accumulate_allreduce_grads_in_fp32
        self.use_contiguous_buffers = use_contiguous_buffers
        # If we are using fp32-accumulate-allreduce explicitly
        # this means we need main grads in a continous buffer.
        if self.accumulate_allreduce_grads_in_fp32:
            assert self.use_contiguous_buffers
    
        # ===================================
        # Rest of this part applies only to
        # the case we use continuous buffers.
        # ===================================
        self._grad_buffers = None
        if self.use_contiguous_buffers: # 这里只考虑连续内存
            self._grad_buffers = {} # 定义buffer
    
            # Simple function to define buffer type.
            def _get_buffer_type(param): # 返回buffer类型
                return torch.float if \
                    self.accumulate_allreduce_grads_in_fp32 else param.dtype
    
            # First calculate total number of elements per type.
            type_num_elements = {}
            for param in self.module.parameters(): # 遍历模型参数
                if param.requires_grad: # 如果需要计算梯度
                    dtype = _get_buffer_type(param) # 获取参数类型
                    type_num_elements[dtype] = type_num_elements.get(dtype, 0) \
                                               + param.data.nelement() # 该类型参数数目做相应增加
    
            # 目前 type_num_elements 是各种类型参数的个数          
            # Allocate the buffer.
            for dtype, num_elements in type_num_elements.items(): # 遍历各种类型
                self._grad_buffers[dtype] = MemoryBuffer(num_elements, dtype) # 分配内存
    
            # 这里是假定反向传播是参数的反方向,存储每个参数梯度的起始位置    
            # Assume the back prop order is reverse the params order, 
            # store the start index for the gradients.
            for param in self.module.parameters(): # 遍历模型参数
                if param.requires_grad: # 如果需要计算梯度
                    dtype = _get_buffer_type(param) # 获取参数类型
                    type_num_elements[dtype] -= param.data.nelement() # 减少size
                    # 确定该参数在MemoryBuffer的位置
                    param.main_grad = self._grad_buffers[dtype].get( # 获取该参数对应的内存
                        param.data.shape, type_num_elements[dtype])
    
            # Backward hook.
            # Accumalation function for the gradients. We need
            # to store them so they don't go out of scope.
            self.grad_accs = []
            # Loop over all the parameters in the model.
            for param in self.module.parameters(): # 遍历模型参数
                if param.requires_grad: # 如果需要计算梯度
                    # Expand so we get access to grad_fn.
                    param_tmp = param.expand_as(param)
                    # Get the gradient accumulator functtion.
                    grad_acc = param_tmp.grad_fn.next_functions[0][0] # 得到参数对应的梯度函数
                    grad_acc.register_hook(self._make_param_hook(param)) # 注册了hook
                    self.grad_accs.append(grad_acc) # 统一管理梯度函数,其实就是book keeping作用
    

    5.2.3 内存

    MemoryBuffer 是内存抽象。

    class MemoryBuffer:
    
        def __init__(self, numel, dtype):
            self.numel = numel
            self.dtype = dtype
            self.data = torch.zeros(self.numel, # 初始化内存
                                    dtype=self.dtype,
                                    device=torch.cuda.current_device(),
                                    requires_grad=False)
    
    
        def zero(self):
            """Reset the buffer to zero."""
            self.data.zero_()
    
    
        def get(self, shape, start_index):
            """Return a tensor with the input `shape` as a view into the
            1-D data starting at `start_index`."""
            end_index = start_index + shape.numel() # 定位到该张量在内存buffer之中的位置
            assert end_index <= self.numel, \
                'requested tensor is out of the buffer range.'
            buffer_tensor = self.data[start_index:end_index] # 拿到内存
            buffer_tensor = buffer_tensor.view(shape)
            return buffer_tensor # 
    

    5.2.4 支撑函数

    下面是两个支撑函数,分别是用于拷贝梯度和将buffer清零。

    def _make_param_hook(self, param):
        """Create the all-reduce hook for backprop."""
        # Hook used for back-prop.
        def param_hook(*unused):
            # Add the gradient to the buffer.
            if param.grad.data is not None:
                param.main_grad.add_(param.grad.data) # 把梯度拷贝到连续内存之中
                # Now we can deallocate grad memory.
                param.grad = None
        return param_hook
    
    def zero_grad_buffer(self):
        """Set the grad buffer data to zero. Needs to be called at the
        begining of each iteration."""
        assert self._grad_buffers is not None, 'buffers are not initialized.'
        for _, buffer_ in self._grad_buffers.items():
            buffer_.zero()
    

    我们假定模型有6个参数,3个 fp32,3 个 fp16,所以被组合成两个连续内存 MemoryBuffer。

    5.2.5 梯度规约

    allreduce_gradients 是 DDP 对外提供的 API,在后面 train step 之中会调用到。

    def allreduce_gradients(self):
        """Reduce gradients across data parallel ranks."""
        # If we have buffers, simply reduce the data in the buffer.
        if self._grad_buffers is not None:
            # 连续内存
            for _, buffer_ in self._grad_buffers.items():  # 遍历各种类型的buffer
                buffer_.data /= mpu.get_data_parallel_world_size()
                torch.distributed.all_reduce( # 统一归并
                    buffer_.data, group=mpu.get_data_parallel_group())
        else:
            # Otherwise, bucketize and all-reduce
            buckets = {} # 否则还是用桶来归并
            # Pack the buckets.
            for param in self.module.parameters(): # 遍历梯度
                if param.requires_grad and param.grad is not None:
                    tp = param.data.type()
                    if tp not in buckets:
                        buckets[tp] = []
                    buckets[tp].append(param) # 同类型的梯度放到对应类型的桶之中
                    param.main_grad = param.grad
    
            # For each bucket, all-reduce and copy all-reduced grads.
            for tp in buckets:
                bucket = buckets[tp]
                grads = [param.grad.data for param in bucket] # 把桶里的梯度拿出来
                coalesced = _flatten_dense_tensors(grads) # 打平梯度
                coalesced /= mpu.get_data_parallel_world_size()
                torch.distributed.all_reduce( # 归并
                    coalesced, group=mpu.get_data_parallel_group())
                for buf, synced in zip(grads, _unflatten_dense_tensors(
                        coalesced, grads)):
                    buf.copy_(synced)
    

    运行时候,分别对两种类型的连续内存做 AllReduce。

    0x06 训练

    Pretrain 之中会调用 train 来进行训练。

    if args.do_train and args.train_iters > 0:
        iteration = train(forward_step_func,
                          model, optimizer, lr_scheduler,
                          train_data_iterator, valid_data_iterator)
    

    6.1 训练主体

    train 是常规的套路,大家基本上按照名字就可以理解。

    def train(forward_step_func, model, optimizer, lr_scheduler,
              train_data_iterator, valid_data_iterator):
        """Train the model function."""
        args = get_args()
        timers = get_timers()
    
        # Write args to tensorboard
        write_args_to_tensorboard()
    
        # Turn on training mode which enables dropout.
        for model_module in model:
            model_module.train() # 
    
        # Tracking loss.
        total_loss_dict = {}
    
        # Iterations.
        iteration = args.iteration
    
        report_memory_flag = True
        while iteration < args.train_iters:
            update_num_microbatches(args.consumed_train_samples)
            loss_dict, skipped_iter, grad_norm, num_zeros_in_grad = \
                train_step(forward_step_func, # 训练
                           train_data_iterator,
                           model,
                           optimizer,
                           lr_scheduler)
            iteration += 1
            args.consumed_train_samples += mpu.get_data_parallel_world_size() * \
                                           args.micro_batch_size * \
                                           get_num_microbatches()
    
            # Logging.
            loss_scale = optimizer.get_loss_scale().item()
            params_norm = None
            if args.log_params_norm:
                params_norm = calc_params_l2_norm(model)
            report_memory_flag = training_log(loss_dict, total_loss_dict,
                                              optimizer.param_groups[0]['lr'],
                                              iteration, loss_scale,
                                              report_memory_flag, skipped_iter,
                                              grad_norm, params_norm, num_zeros_in_grad)
    
            # Autoresume
            if args.adlr_autoresume and \
               (iteration % args.adlr_autoresume_interval == 0):
                check_adlr_autoresume_termination(iteration, model, optimizer,
                                                  lr_scheduler)
    
            # Evaluation
            if args.eval_interval and iteration % args.eval_interval == 0 and \
               args.do_valid:
                prefix = 'iteration {}'.format(iteration)
                evaluate_and_print_results(prefix, forward_step_func,
                                           valid_data_iterator, model,
                                           iteration, False)
    
            # Checkpointing
            saved_checkpoint = False
            if args.exit_signal_handler:
                signal_handler = get_signal_handler()
                if any(signal_handler.signals_received()):
                    save_checkpoint_and_time(iteration, model, optimizer,
                                             lr_scheduler)
                    sys.exit()
    
            if args.save and args.save_interval and \
               iteration % args.save_interval == 0:
                save_checkpoint_and_time(iteration, model, optimizer,
                                         lr_scheduler)
                saved_checkpoint = True
    
            # Exiting based on duration
            if args.exit_duration_in_mins:
                train_time = (time.time() - _TRAIN_START_TIME) / 60.0
                done_cuda = torch.cuda.IntTensor(
                    [train_time > args.exit_duration_in_mins])
                torch.distributed.all_reduce(
                    done_cuda, op=torch.distributed.ReduceOp.MAX)
                done = done_cuda.item()
                if done:
                    if not saved_checkpoint:
                        save_checkpoint_and_time(iteration, model, optimizer,
                                                 lr_scheduler)
                    sys.exit()
    
            # Exiting based on iterations
            if args.exit_interval and iteration % args.exit_interval == 0:
                if not saved_checkpoint:
                    save_checkpoint_and_time(iteration, model, optimizer,
                                             lr_scheduler)
                torch.distributed.barrier()
                sys.exit()
    
        return iteration
    

    6.2 训练step

    train_step 会获取 get_forward_backward_func 得到 schedule,因为是流水线并行,所以需要 schedule 如何具体训练。

    def train_step(forward_step_func, data_iterator,
                   model, optimizer, lr_scheduler):
        """Single training step."""
        args = get_args()
        timers = get_timers()
    
        # Set grad to zero.
        if args.DDP_impl == 'local' and args.use_contiguous_buffers_in_local_ddp:
            for partition in model:
                partition.zero_grad_buffer()
        optimizer.zero_grad()
    
        # 获取训练schedule
        forward_backward_func = get_forward_backward_func()
        losses_reduced = forward_backward_func( # 进行训练
            forward_step_func, data_iterator, model,
            optimizer, timers, forward_only=False)
    
        # Empty unused memory
        if args.empty_unused_memory_level >= 1:
            torch.cuda.empty_cache()
    
        # All-reduce if needed.
        if args.DDP_impl == 'local':
            for model_module in model:
                model_module.allreduce_gradients()
    
        # All-reduce word_embeddings' grad across first and last stages to ensure
        # that word_embeddings parameters stay in sync.
        # This should only run for models that support pipelined model parallelism
        # (BERT and GPT-2).
        if mpu.is_rank_in_embedding_group(ignore_virtual=True) and \
                mpu.get_pipeline_model_parallel_world_size() > 1:
            if mpu.is_pipeline_first_stage(ignore_virtual=True):
                unwrapped_model = model[0]
            elif mpu.is_pipeline_last_stage(ignore_virtual=True):
                unwrapped_model = model[-1]
            else:  # We do not support the interleaved schedule for T5 yet.
                unwrapped_model = model[0]
            unwrapped_model = unwrap_model(
                unwrapped_model, (torchDDP, LocalDDP, Float16Module))
    
            if unwrapped_model.share_word_embeddings:
                word_embeddings_weight = unwrapped_model.word_embeddings_weight()
                if args.DDP_impl == 'local':
                    grad = word_embeddings_weight.main_grad
                else:
                    grad = word_embeddings_weight.grad
                torch.distributed.all_reduce(grad, group=mpu.get_embedding_group())
    
        # Update parameters.
        update_successful, grad_norm, num_zeros_in_grad = optimizer.step()
    
        # Update learning rate.
        if update_successful:
            increment = get_num_microbatches() * \
                        args.micro_batch_size * \
                        args.data_parallel_size
            lr_scheduler.step(increment=increment)
            skipped_iter = 0
        else:
            skipped_iter = 1
    
        # Empty unused memory
        if args.empty_unused_memory_level >= 2:
            torch.cuda.empty_cache()
    
        if mpu.is_pipeline_last_stage(ignore_virtual=True):
            # Average loss across microbatches.
            loss_reduced = {}
            for key in losses_reduced[0]:
                losses_reduced_for_key = [x[key] for x in losses_reduced]
                loss_reduced[key] = sum(losses_reduced_for_key) / len(losses_reduced_for_key)
            return loss_reduced, skipped_iter, grad_norm, num_zeros_in_grad
        return {}, skipped_iter, grad_norm, num_zeros_in_grad
    

    6.3 获取schedule

    get_forward_backward_func 获取 pipeline 的schedule,这里分为 flush 和 interleaving 两种,我们后续会分析这两种schedule。

    def get_forward_backward_func():
        args = get_args()
        if mpu.get_pipeline_model_parallel_world_size() > 1:
            if args.virtual_pipeline_model_parallel_size is not None:
                forward_backward_func = forward_backward_pipelining_with_interleaving
            else:
                forward_backward_func = forward_backward_pipelining_without_interleaving
        else:
            forward_backward_func = forward_backward_no_pipelining
        return forward_backward_func
    

    训练逻辑大体拓展为:

    至此,Megatron 基本架构分析完毕,下一篇我们介绍模型并行设置。

    0xFF 参考

    [细读经典]Megatron论文和代码详细分析(2)

    [细读经典]Megatron论文和代码详细分析(1)

    Megatron-LM源码阅读(一)

    Megatron-LM源码阅读(二)

    megatron学习总结

    GTC 2020: Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    www.DeepL.com/Translator

    https://developer.nvidia.com/gtc/2020/slides/s21496-megatron-lm-training-multi-billion-parameter-language-models-using-model-parallelism.pdf

    NVIDIA解决方案架构师深度解析大规模参数语言模型Megatron-BERT

  • 相关阅读:
    web开发(六) EL表达式
    web开发(五) JSP详解(四大作用域九大内置对象等)
    web开发(四) 一次性验证码的代码实现
    Netty4
    Android Fragment
    Android 6.0 双向通话自动录音
    安卓
    SpringMVC + Spring + Mybatis+ Redis +shiro以及MyBatis学习
    Spring 3 AOP 概念及完整示例
    Java并发之CountDownLatch、CyclicBarrier和Semaphore
  • 原文地址:https://www.cnblogs.com/rossiXYZ/p/15868988.html
Copyright © 2020-2023  润新知