#1 pangu_dialog_tune.sh 微调提示:size mismatch for weight错误

Closed
created 1 year ago by qinghao · 1 comments
qinghao commented 1 year ago
root@17b3a541b30c:/workspace# /workspace/pangu-alpha-applications/app/chat/scripts/pangu_dialog_tune.sh ***************************************** Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. ***************************************** /workspace/pangu-alpha-applications /workspace/pangu-alpha-applications using world size: 2 and model-parallel size: 2 using torch.float16 for parameters ... WARNING: overriding default arguments for tokenizer_type:GPT2BPETokenizer with tokenizer_type:GPT2BPETokenizer -------------------- arguments -------------------- adlr_autoresume ................. False adlr_autoresume_interval ........ 1000 apply_query_key_layer_scaling ... False apply_residual_connection_post_layernorm False attention_dropout ............... 0.1 attention_softmax_in_fp32 ....... False batch_size ...................... 64 bert_load ....................... None bias_dropout_fusion ............. False bias_gelu_fusion ................ False block_data_path ................. None checkpoint_activations .......... True checkpoint_num_layers ........... 1 clip_grad ....................... 1.0 data_impl ....................... mmap data_path ....................... /workspace/finetune/data/ DDP_impl ........................ torch distribute_checkpointed_activations True distributed_backend ............. nccl dynamic_loss_scale .............. True eod_mask_loss ................... False eval_interval ................... 500 eval_iters ...................... 200 exit_interval ................... None faiss_use_gpu ................... False finetune ........................ True fp16 ............................ True fp16_lm_cross_entropy ........... True fp32_allreduce .................. False gradient_accumulation_steps ..... 1 hidden_dropout .................. 0.1 hidden_size ..................... 2560 hysteresis ...................... 2 ict_head_size ................... None ict_load ........................ None indexer_batch_size .............. 128 indexer_log_interval ............ 1000 init_method_std ................. 0.02 layernorm_epsilon ............... 1e-05 lazy_mpu_init ................... None load ............................ /workspace/pangu_dialog_fp16_2b6/ local_rank ...................... 0 log_interval .................... 100 loss_scale ...................... None loss_scale_window ............... 1000 lr .............................. 5e-05 lr_decay_iters .................. 1000 lr_decay_style .................. cosine make_vocab_size_divisible_by .... 1 mask_prob ....................... 0.15 max_position_embeddings ......... 1024 merge_file ...................... gpt2-merges.txt min_lr .......................... 1e-06 min_scale ....................... 1 mmap_warmup ..................... False model_parallel_size ............. 2 no_load_optim ................... False no_load_rng ..................... True no_save_optim ................... False no_save_rng ..................... False num_attention_heads ............. 32 num_layers ...................... 31 num_unique_layers ............... None num_workers ..................... 2 onnx_safe ....................... None openai_gelu ..................... False override_lr_scheduler ........... False param_sharing_style ............. grouped params_dtype .................... torch.float16 query_in_block_prob ............. 0.1 rank ............................ 0 report_topk_accuracies .......... [] reset_attention_mask ............ False reset_position_ids .............. False save ............................ /workspace/finetune/model/pangu_dialog_fp16_2b6_new/ save_interval ................... 10000 scaled_upper_triang_masked_softmax_fusion False seed ............................ 1234 seq_length ...................... 1024 short_seq_prob .................. 0.1 split ........................... 949,50,1 tensorboard_dir ................. None titles_data_path ................ None tokenizer_type .................. GPT2BPETokenizer train_iters ..................... 50000 use_checkpoint_lr_scheduler ..... False use_cpu_initialization .......... True use_one_sent_docs ............... False vocab_file ...................... /workspace/pangu-alpha-applications/megatron/bpe_4w_pcl/vocab warmup .......................... 0.01 weight_decay .................... 0.01 world_size ...................... 2 ---------------- end of arguments ---------------- > building GPT2BPETokenizer tokenizer ... > padded vocab (size: 40000) with 0 dummy tokens (new size: 40000) > initializing torch distributed ... ParseResult(scheme='tcp', netloc='localhost:6000', path='', params='', query='rank=0&world_size=2', fragment='') ParseResult(scheme='tcp', netloc='localhost:6000', path='', params='', query='rank=1&world_size=2', fragment='') > initializing model parallel with size 2 > setting random seeds to 1234 ... > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 > building the checkpointed activations memory buffer with 2600468480 num elements and torch.float16 dtype (4960.0 MB)... building PT model ... > number of parameters on model parallel rank 1: 1315517440 > number of parameters on model parallel rank 0: 1315517440 global rank 1 is loading checkpoint /workspace/pangu_dialog_fp16_2b6/iter_0049705/mp_rank_01/model_optim_rng.pt > learning rate decay style: cosine global rank 0 is loading checkpoint /workspace/pangu_dialog_fp16_2b6/iter_0049705/mp_rank_00/model_optim_rng.pt could not find arguments in the checkpoint ... Traceback (most recent call last): File "/workspace/pangu-alpha-applications/app/chat/train/training_main.py", line 107, in <module> prompt_tune_train(train_valid_test_datasets_provider, File "/workspace/pangu-alpha-applications/method/prompttune/tasks/prompt_tune.py", line 69, in prompt_tune_train model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider) File "/workspace/pangu-alpha-applications/method/prompttune/tasks/prompt_tune.py", line 218, in setup_model_and_optimizer args.iteration = load_checkpoint(model, optimizer, lr_scheduler) File "/workspace/pangu-alpha-applications/megatron/checkpointing.py", line 210, in load_checkpoint model.load_state_dict(state_dict['model']) File "/workspace/pangu-alpha-applications/megatron/fp16/fp16.py", line 85, in load_state_dict self.module.load_state_dict(state_dict, strict=strict) File "/workspace/pangu-alpha-applications/megatron/model/gpt2_model.py", line 135, in load_state_dict self.language_model.load_state_dict(state_dict, strict=strict) File "/workspace/pangu-alpha-applications/megatron/model/language_model.py", line 502, in load_state_dict self.embedding.load_state_dict(state_dict_, strict=strict) File "/workspace/pangu-alpha-applications/megatron/model/language_model.py", line 218, in load_state_dict self.word_embeddings.load_state_dict(state_dict_, strict=strict) File "/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1070, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for VocabParallelEmbedding: size mismatch for weight: copying a param with shape torch.Size([40000, 2560]) from checkpoint, the shape in current model is torch.Size([20000, 2560]). Traceback (most recent call last): File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 303, in <module> main() File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launch.py", line 294, in main raise subprocess.CalledProcessError(returncode=process.returncode, subprocess.CalledProcessError: Command '['/opt/conda/bin/python', '-u', '/workspace/pangu-alpha-applications/app/chat/train/training_main.py', '--local_rank=1', '--model-parallel-size', '2', '--num-layers', '31', '--hidden-size', '2560', '--num-attention-heads', '32', '--batch-size', '64', '--seq-length', '1024', '--max-position-embeddings', '1024', '--train-iters', '50000', '--lr-decay-iters', '1000', '--save', '/workspace/finetune/model/pangu_dialog_fp16_2b6_new/', '--load', '/workspace/pangu_dialog_fp16_2b6/', '--data-path', '/workspace/finetune/data/', '--vocab-file', '/workspace/pangu-alpha-applications/megatron/bpe_4w_pcl/vocab', '--merge-file', 'gpt2-merges.txt', '--data-impl', 'mmap', '--split', '949,50,1', '--distributed-backend', 'nccl', '--lr', '0.00005', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-6', '--weight-decay', '1e-2', '--clip-grad', '1.0', '--warmup', '.01', '--log-interval', '100', '--save-interval', '10000', '--eval-interval', '500', '--eval-iters', '200', '--attention-dropout', '0.1', '--hidden-dropout', '0.1', '--seed', '1234', '--finetune', '--DDP-impl', 'torch', '--checkpoint-activations', '--distribute-checkpointed-activations', '--fp16-lm-cross-entropy', '--use-cpu-initialization', '--make-vocab-size-divisible-by', '1', '--fp16']' returned non-zero exit status 1. 好像是vocab文件的问题 size mismatch for weight: copying a param with shape torch.Size([40000, 2560]) from checkpoint, the shape in current model is torch.Size([20000, 2560]).
qinghao commented 1 year ago
Poster
非常感谢盘古ALPHA微信群内的大佬解答。 实验环境:2* RTX 3090 内存:64GB 使用GPUS_PER_NODE=1 NNODES=1 单卡情况下可以跑通这一步,多卡大佬的解答为: "如果你做了模型并行的话,先把模型做一下拆分,再加载"
qinghao closed this issue 1 year ago
Sign in to join this conversation.
No Label
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.