关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

zhangyh02 d8b5ce05b4 update readme (#1 ) Update 'README.md' Update 'README.md' Reviewed-on: https://git.openi.org.cn/PCL-Platform.Intelligence/Model-Compression/pulls/1		2 years ago
inference	add strategy file.	3 years ago

README.md	update readme (#1)	2 years ago

README_CN.md	adjust readme.	2 years ago

README.md

Model-Compression

Model-Compression

In the inference phase, the PanGu α-13B&2.6B models are compressed from 8 NPUs to 1 NPU with only about 2% performance fluctuations. To achieve this, a variety of model compression techniques are applied and the MindSpore source codes are adapted. Please modify the policy file and model file paths of Pangu-α-13B/2.6B when using it. The implementation is based on the settings of PengCheng Cloud Brain II. For local implementation, please modify the file path and the codes related to file replication.

Methods
Codes
Performances
Environments

Methods

Quantization
By loading a model with low precision, most of the float32 parameters can be converted to float16 and the corresponding quantization noise is processed.
Parameter sharing
This model has been adapted to share the output layer parameters with the embedding layer parameters.
When the embedding size is 2560 and the vocabulary size is 40000, 40000 * 2560 parameters can be saved.
Mindspore source code adaptation
The model parallelism strategies during training and loading are inconsistent. Specifically, semi-automatic model parallelism is used during training and no model parallelism is used during loading.
In addition, the parameter types saved after training are also inconsistent with the model parameter types used during inference. Thus, the underlying support of MindSpore needs to be modified.

Codes

There are three main differences in the inference code.

Model definition

pangu_dropout_recompute_eos_fp16.py

MindSpore modification

mindspore_ascend-1.1.0-cp37-cp37m-linux_aarch64.whl

Model loading

from eval_task-13b-fp16 import get_model
model = get_model(args)

Performances

Memory occupation and inference speed

Model	Memory occupation	inference speed
PanGu-α-13B (Before compression)	8NPU	~150ms
PanGu-α-13B (After compression)	1NPU	~250ms

Downstream tasks

	WebQA.v1.0 (em/f1)	CLUEWSC2020 (acc)
zero shot
PanGu-α-13B (Before compression)	5.126/14.470	75.000
PanGu-α-13B (After compression)	5.060/14.466	73.684

Environments

MindSpore
PanGu-α

将盘古α-13B/2.6B由8卡压缩到1卡上推理。

大模型模型压缩

Python Shell Markdown other

673961049@qq.com jiangfq@pcl.ac.cn

How to access data resources in code