Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
xiaoxiong 425b84f07e | 1 year ago | |
---|---|---|
README.md | 1 year ago |
DuEL 2.0 是一个以中文短文本实体链接为目标任务的数据集。该数据集中的样本主要来自于搜索Query、微博、对话内容、标题等,样本的口语化严重,上下文语境不丰富,难度较大。此外,DuEL2.0数据集具有如下特点:
1、大规模:7万训练集、1万开发集、1万测评集32.4万知识库实体,282.6万SPO;
2、高质量:所有标注数据通过人工众包完成,实体链指及实体类型准确率达95%,知识库实体重复率小于5%;
3、面向真实场景:数据来自于互联网网页标题、UGC短视频标题、搜索Query。
DuEL2.0数据集主要由知识库、标注数据集两部分组成,详情如下:
该任务知识库来自百度百科知识库。知识库中的每个实体都包含一个subject_id(知识库id),一个subject名称,实体的别名,对应的概念类型,以及与此实体相关的一系列二元组<predicate,object>(<属性,属性值>)信息形式。知识库中每行代表知识库的一条记录(一个实体信息),每条记录为json数据格式。
示例如下所示:
{
"subject_id": "1000131",
"subject": "小王子",
"alias": [
"Le Petit Prince",
"The Little Prince",
"リトルプリンス 星の王子さまと私",
"小王子"
],
"type": [
"Work"
],
"data": [
{
"predicate": "外文名",
"object": "Le Petit Prince"
},
{
"predicate": "发行公司",
"object": "派拉蒙影业"
},
{
"predicate": "类型",
"object": "奇幻"
}
]
}
标注数据集由训练集、验证集和测试集组成,整体标注数据大约9万条左右,标注数据源主要来自于:互联网网页标题、UGC短视频标题、搜索Query。所有标注数据均通过百度众包标注生成。 标注数据集中每条数据的格式为:
{
"text_id": "1",
"text": "《琅琊榜》海宴_【原创小说|权谋小说】",
"mention_data": [
{
"kb_id": "2135131",
"mention": "琅琊榜",
"offset": "1"
},
{
"kb_id": "10572965",
"mention": "海宴",
"offset": "5"
},
{
"kb_id": "215143",
"mention": "原创小说",
"offset": "9"
},
{
"kb_id": " NIL_Work ",
"mention": "权谋小说",
"offset": "14"
}
]
}
以上数据集的整体情况见下表:
数据集名称 | 训练集大小 | 开发集大小 | 测试集大小 | 实体个数 | SPO数量 | 实体平均属性数量 | 实体描述平均长度 |
---|---|---|---|---|---|---|---|
DuEL 2.0 | 69942 | 9990 | 9993 | 32.4万 | 282.6万 | 8.71 | 39.88 |
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》