关于GCU、沐曦GPGPU、MLU、0卡V100资源4月7日恢复上架的公告>>> 关于共建具身智能开源数据集的倡议>>> 关于云脑任务中统一路径访问方式的公告>>> 关于将启智集群GPU资源迁移至智算集群的公告>>>

lyb 3d2fbd193a 更新 'models/bert.py'		1 year ago
.idea	add	1 year ago

ERNIE_pretrain	add	1 year ago

THUCNews	add	1 year ago

__pycache__	add	1 year ago

bert_pretrain	add	1 year ago

models	更新 'models/bert.py'	1 year ago

pytorch_pretrained	更新 'pytorch_pretrained/modeling.py'	1 year ago

LICENSE	add	1 year ago

README.md	add	1 year ago

run.py	更新 'run.py'	1 year ago

train_eval.py	add	1 year ago

utils.py	更新 'utils.py'	1 year ago

README.md

Bert-Chinese-Text-Classification-Pytorch

Bert-Chinese-Text-Classification-Pytorch

中文文本分类，Bert，ERNIE，基于pytorch，开箱即用。

介绍

模型介绍、数据流动过程：还没写完，写好之后再贴博客地址。
机器：一块2080Ti ，训练时间：30分钟。

环境

python 3.7
pytorch 1.1
tqdm
sklearn
tensorboardX
~~pytorch_pretrained_bert~~(预训练代码也上传了, 不需要这个库了)

中文数据集

我从THUCNews中抽取了20万条新闻标题，已上传至github，文本长度在20到30之间。一共10个类别，每类2万条。数据以字为单位输入模型。

类别：财经、房产、股票、教育、科技、社会、时政、体育、游戏、娱乐。

数据集划分：

数据集	数据量
训练集	18万
验证集	1万
测试集	1万

更换自己的数据集

按照我数据集的格式来格式化你的中文数据集。

效果

模型	acc	备注
bert	94.83%	单纯的bert
ERNIE	94.61%	说好的中文碾压bert呢
bert_CNN	94.44%	bert + CNN
bert_RNN	94.57%	bert + RNN
bert_RCNN	94.51%	bert + RCNN
bert_DPCNN	94.47%	bert + DPCNN

原始的bert效果就很好了，把bert当作embedding层送入其它模型，效果反而降了，之后会尝试长文本的效果对比。

CNN、RNN、DPCNN、RCNN、RNN+Attention、FastText等模型效果，请见我另外一个仓库。

预训练语言模型

bert模型放在 bert_pretain目录下，ERNIE模型放在ERNIE_pretrain目录下，每个目录下都是三个文件：

pytorch_model.bin
bert_config.json
vocab.txt

预训练模型下载地址：
bert_Chinese: 模型 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz
词表 https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt
来自这里
备用：模型的网盘地址：https://pan.baidu.com/s/1qSAD5gwClq7xlgzl_4W3Pw

ERNIE_Chinese: http://image.nghuyong.top/ERNIE.zip
来自这里
备用：网盘地址：https://pan.baidu.com/s/1lEPdDN1-YQJmKEd_g9rLgw

解压后，按照上面说的放在对应目录下，文件名称确认无误即可。

使用说明

下载好预训练模型就可以跑了。

# 训练并测试：
# bert
python run.py --model bert

# bert + 其它
python run.py --model bert_CNN

# ERNIE
python run.py --model ERNIE

参数

模型都在models目录下，超参定义和模型定义在同一文件中。

未完待续

bert + CNN, RNN, RCNN, DPCNN等
ERNIE + CNN, RNN, RCNN, DPCNN等
XLNET
另外想加个label smoothing试试效果
封装预测功能

对应论文

[1] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
[2] ERNIE: Enhanced Representation through Knowledge Integration

No Description

Text Python

lin15892002160@163.com

How to access data resources in code