Pangu-RLHF-IAN

关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

IANZHU f54bd6463a 上传文件至 'dialogue_dir'		1 year ago
.idea	RLHF	1 year ago

configs	RLHF_v2	1 year ago

dialogue_dir	上传文件至 'dialogue_dir'	1 year ago

reward_data_dir/processed	RLHF	1 year ago

reward_model	RLHF_v2	1 year ago

sft	RLHF_v2	1 year ago

README.md	RLHF_v2	1 year ago

dataprocess.py	rlhf	1 year ago

requirements.txt	rlhf	1 year ago

trainPanguPPO.sh	rlhf	1 year ago

trlx_pangu_rlhf.py	RLHF	1 year ago

基于`trlx`库使用RLHF训练Pangu 2.6B中文对话模型pipeline

我们的pipeline是基于OpenAI论文 "Learning to Summarize from human feedback"的开源代码进行修改。

准备阶段

1). 需要配置trlx库相关环境，参考 "[trlx] (https://github.com/CarperAI/trlx)"

git clone https://github.com/CarperAI/trlx.git
cd trlx
pip install torch --extra-index-url https://download.pytorch.org/whl/cu116 # for cuda
pip install -e .

2). 下载盘古-2.6B模型:

    https://huggingface.co/imone/pangu_2_6B

3). 准备SFT数据集(以webtext为例):

    https://paperswithcode.com/dataset/webtext

保存至: ./dialogue_dir/demo.json

4). 收集人工反馈数据
保存至: ./reward_data_dir/processed/demo.json

训练步骤

1). 监督微调 (SFT):

cd sft/ && deepspeed train_SFT.py

2). 训练 Reward 模型:

cd reward_model/ && deepspeed train_reward_model.py

3). 使用PPO算法强化学习:

accelerate launch --config_file configs/default_accelerate_config.yaml trlx_pangu_rlhf.py

备注: 至少需要1张V100显卡。

参考文献

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, Paul Christiano, "Learning to Summarize from human feedback", Neural Information Processing Systems, 2020.

基于人工反馈增强盘古2.6B模型

Python Text Shell

491377729@qq.com

How to access data resources in code

README.md

基于trlx库使用RLHF训练Pangu 2.6B中文对话模型pipeline

准备阶段

训练步骤

参考文献

Contributors (2) All

基于`trlx`库使用RLHF训练Pangu 2.6B中文对话模型pipeline

Contributors (2)
All