Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
lyb 3d2fbd193a | 1 year ago | |
---|---|---|
.idea | 1 year ago | |
ERNIE_pretrain | 1 year ago | |
THUCNews | 1 year ago | |
__pycache__ | 1 year ago | |
bert_pretrain | 1 year ago | |
models | 1 year ago | |
pytorch_pretrained | 1 year ago | |
LICENSE | 1 year ago | |
README.md | 1 year ago | |
run.py | 1 year ago | |
train_eval.py | 1 year ago | |
utils.py | 1 year ago |
中文文本分类,Bert,ERNIE,基于pytorch,开箱即用。
模型介绍、数据流动过程:还没写完,写好之后再贴博客地址。
机器:一块2080Ti , 训练时间:30分钟。
python 3.7
pytorch 1.1
tqdm
sklearn
tensorboardX
pytorch_pretrained_bert(预训练代码也上传了, 不需要这个库了)
我从THUCNews中抽取了20万条新闻标题,已上传至github,文本长度在20到30之间。一共10个类别,每类2万条。数据以字为单位输入模型。
类别:财经、房产、股票、教育、科技、社会、时政、体育、游戏、娱乐。
数据集划分:
数据集 | 数据量 |
---|---|
训练集 | 18万 |
验证集 | 1万 |
测试集 | 1万 |
模型 | acc | 备注 |
---|---|---|
bert | 94.83% | 单纯的bert |
ERNIE | 94.61% | 说好的中文碾压bert呢 |
bert_CNN | 94.44% | bert + CNN |
bert_RNN | 94.57% | bert + RNN |
bert_RCNN | 94.51% | bert + RCNN |
bert_DPCNN | 94.47% | bert + DPCNN |
原始的bert效果就很好了,把bert当作embedding层送入其它模型,效果反而降了,之后会尝试长文本的效果对比。
CNN、RNN、DPCNN、RCNN、RNN+Attention、FastText等模型效果,请见我另外一个仓库。
bert模型放在 bert_pretain目录下,ERNIE模型放在ERNIE_pretrain目录下,每个目录下都是三个文件:
预训练模型下载地址:
bert_Chinese: 模型 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz
词表 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt
来自这里
备用:模型的网盘地址:https://pan.baidu.com/s/1qSAD5gwClq7xlgzl_4W3Pw
ERNIE_Chinese: http://image.nghuyong.top/ERNIE.zip
来自这里
备用:网盘地址:https://pan.baidu.com/s/1lEPdDN1-YQJmKEd_g9rLgw
解压后,按照上面说的放在对应目录下,文件名称确认无误即可。
下载好预训练模型就可以跑了。
# 训练并测试:
# bert
python run.py --model bert
# bert + 其它
python run.py --model bert_CNN
# ERNIE
python run.py --model ERNIE
模型都在models目录下,超参定义和模型定义在同一文件中。
[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[2] ERNIE: Enhanced Representation through Knowledge Integration
No Description
Text Python
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》