Cx 83388b1c25 | 1 year ago | |
---|---|---|
README.md | 1 year ago |
Few-NERD is a large-scale, fine-grained manually annotated named entity recognition dataset, which contains 8 coarse-grained types, 66 fine-grained types, 188,200 sentences, 491,711 entities, and 4,601,223 tokens. Three benchmark tasks are built, one is supervised (Few-NERD (SUP)) and the other two are few-shot (Few-NERD (INTRA) and Few-NERD (INTER)). Few-NERD is collected by researchers from Tsinghua University and DAMO Academy, Alibaba Group.
Website: https://ningding97.github.io/fewnerd/
Paper: https://arxiv.org/abs/2105.07464
Github: https://github.com/thunlp/Few-NERD
Chinese: Few-NERD是一个大规模,多粒度的人工标注命名实体识别(Named Entity Recognition, NER)数据集,包含了8个大类,66个小类,18万余个句子,49余万个实体。本数据集包括3个任务,分别为标准监督NER(Few-NERD (SUP)),跨大类Few-shot NER(Few-NERD (INTRA))和不跨大类的Few-shot NER (Few-NERD (INTER))。Few-NERD由清华大学和阿里巴巴的研究者构建而成。
网址: https://ningding97.github.io/fewnerd/
论文: https://arxiv.org/abs/2105.07464
代码: https://github.com/thunlp/Few-NERD
近来,围绕着 "少样本命名实体识别"(few-shot NER)这一主题,出现了大量的工作和文献。“少样本命名实体识别”任务具有实际应用价值,也充满挑战性。但是目前鲜有专门针对该任务的基准数据,之前的大多数研究都是通过重新组织现有的有监督NER数据集,使其成为“少样本”场景下的数据集。这些策略通常旨在通过少量的例子来识别粗粒度的实体类型,而在实践中,大多数实体类型都是细粒度的。在本文中,我们发布了Few-NERD,一个大规模的人工标注的用于few-shot NER任务的数据集。该数据集包含8种粗粒度和66种细粒度实体类型,每个实体标签均为粗粒度+细粒度的层级结构,共有188,238个来自维基百科的句子,4,601,160个词,每个词都被注释为上下文(context)或一个实体类型的一部分。这是第一个few-shot NER数据集,也是最大的人工标注NER数据集。我们构建了具有不同侧重点的基准任务来全面评估模型的泛化能力。广泛的实证结果和分析表明,few-shot NER任务充满挑战性,亟待进一步研究。
enter image description here
Few-shot 中的采样
命名实体识别即在文本中标出实体。在少样本的场景下,样本被按照批次(episode)组织成N-way-K-shot形式的数据。每个批次的数据又被组织成两个集合,support set 和 query set,其中support set用于学习,query set用于预测。其含义是,在每一批(episode)的support set中含有N种类型的实体,每种类型有K个实体,query set含有与support set同类型的实体。模型通过对support set的学习,来预测query set的标签。由于NER是一个跟语境强相关的任务,采样通常在句子层面进行。又由于一句话中可能含有多个类型的多个实体,一般很难通过句子级别的采样严格满足N-way-K-shot的场景设定。因此,我们设计了基于贪心策略的更为宽松的采样方法。该采样方法能够将每个实体类型的数量限制在K~2K之间,即每次随机抽样一句话加入集合,计算当前集合中的实体类型数量和每个实体类型的实例数量,若它们超过N或2K,则舍弃这句话;否则,将这句话加入集合中,直到满足N个实体类型,每个类型至少K个实体为止。
enter image description here
注:若您希望进行Few-shot的研究,建议参考论文和代码中采样器的实现。
数据格式
Between O
1789 O
and O
1793 O
he O
sat O
on O
a O
committee O
reviewing O
the O
administrative MISC-law
constitution MISC-law
of MISC-law
Galicia MISC-law
to O
little O
effect O
. O
Few-NERD (SUP):标准的句子级别监督NER数据集
Few-NERD (INTRA):跨大类的Few-shot NER数据集
Few-NERD (INTER):不跨大类的Few-shot NER数据集
如果您使用Few-NERD,请引用我们的论文:
@article{ding2021few,
title={Few-NERD: A Few-Shot Named Entity Recognition Dataset},
author={Ding, Ning and Xu, Guangwei and Chen, Yulin and Wang, Xiaobin and Han, Xu and Xie, Pengjun and Zheng, Hai-Tao and Liu, Zhiyuan},
journal={arXiv preprint arXiv:2105.07464},
year={2021}
}
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》