Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
superqing 372a4e086c | 2 years ago | |
---|---|---|
inference | 3 years ago | |
README.md | 2 years ago |
将盘古α-13B/2.6B由8卡压缩到1卡上推理,性能仅波动2%左右。结合模型压缩技术和Mindspore底层源码修改,成功将盘古α-13B/2.6B压缩到单卡上跑通,显存占用由8卡降低到占用1卡,节省7张卡。使用时,请修改盘古α-13B/2.6B对应的策略文件和模型文件路径。基于云脑实现,如需本地实现请修改文件路径和文件复制的相关代码。
量化
用低精度加载模型,将大部分参数float32-->float16,以及对应的量化噪声处理。
因为盘古α-13B在训练时采用了混合精度模型,大量以fp32存储的参数实际是用fp16参与计算,所以我们的目标是把这部分参数挑选出来并用fp16存储便在几乎不影响性能的情况下极大的降低模型占用显存量。
参数共享
将输出层参数与embedding层参数共享,已采用了这种模型。
对于embedding size为2560词表大小为40000的情形可节省参数40000*2560。
Mindspore底层源码修改
模型并行不一致,即训练时半自动模型并行,加载时为非模型并行。
类型不一致问题,训练保存的参数类型与推断时模型参数类型不一致需要修改mindspore的底层支持。
与8卡模型并行的推理代码不同的点主要有三个方面。
pangu_dropout_recompute_eos_fp16.py
mindspore_ascend-1.1.0-cp37-cp37m-linux_aarch64.whl
from eval_task-13b-fp16 import get_model
model = get_model(args)
Model | 显存占用 | 单次推理时间 |
---|---|---|
盘古α-13B 8卡(压缩前) | 8卡 | ~150ms |
盘古α-13B 1卡(压缩后) | 1卡 | ~250ms |
WebQA.v1.0 (em/f1) | CLUEWSC2020 (acc) | |
---|---|---|
zero shot | ||
盘古α-13B 8卡(压缩前) | 5.126/14.470 | 75.000 |
盘古α-13B 1卡(压缩后) | 5.060/14.466 | 73.684 |
将盘古α-13B/2.6B由8卡压缩到1卡上推理。
Python Shell
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》