GENIA项目-GENIA语料库-阿里云开发者社区

GENIA项目-GENIA语料库

2017-11-27 2626

版权

本文内容由阿里云实名注册用户自发贡献，版权归原作者所有，阿里云开发者社区不拥有其著作权，亦不承担相应法律责任。具体规则请查看《阿里云开发者社区用户服务协议》和《阿里云开发者社区知识产权保护指引》。如果您发现本社区中有涉嫌抄袭的内容，填写侵权投诉表单进行举报，一经查实，本社区将立刻删除涉嫌侵权内容。

简介：

GENIA corpus

The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.

GENIA语料库是为GENIA项目编写并标注的最初的生物医学文献集合。这个语料库是为了发展和评估分子生物学信息检索及文本挖掘系统而创建的。

The corpus contains 1,999 Medline abstracts, selected using a PubMed query for the three MeSH terms "human", "blood cells", and "transcription factors". The corpus has been annotated with various levels of linguistic and semantic information.

PubMed 是一个免费的搜寻引擎，提供生物医学方面的论文搜寻以及摘要。它的数据库来源为MEDLINE。其核心主题为医学，但亦包括其他与医学相关的领域，像是护理学或者其他健康学科。它同时也提供对于相关生物医学资讯上相当全面的支援，像是生化学与细胞生物学。该搜寻引擎是由美国国立医学图书馆提供，作为 Entrez 资讯检索系统的一部分。PubMed 的资讯并不包括期刊论文的全文，但可能提供指向全文提供者（付费或免费）的连结。

这个语料库包含1999条Medline的摘要，这些摘要是由PubMed按照human、blood cells以及transcription factors三个医学主题词（medical subject heading terms ）为搜索条件搜索到的。这个语料库已经被按照不同级别的语言信息、语义信息进行标注。

The primary categories of annotation in the GENIA corpus and the corresponding subcorpora are

最初始的GENIA语料库标注类别以及对应的资料如下：

Part-of-Speech annotation
Constituency (phrase structure) syntactic annotation
Term annotation
Event annotation
Relation annotation
Coreference annotation
词性标注
句法标注
术语标注
事件标注
关系表述
共指标注

词性标注： http://www.nactem.ac.uk/genia/genia-corpus/pos-annotation

Overview

综述

Part-of-speech (POS) tagging is an initial step of natural language processing which is often performed right after or together with tokenization. After tokenization, every token is assigned a POS label. The GENIA POS annotation generally follows the Penn Treebank POS tagging scheme. The following modifications of this scheme were introduced for the GENIA part-of-speech annotation:

POS标注是自然语言处理的初始步骤，通常在分词之后或与分词同时进行。分词之后，每个词都被分配一个POS标签。GENIA POS标注大体上遵循滨州树库POS标签体系。为了使这个体系适用于GENIA，做了以下修改。，

The NNP and NNPS (proper name) tag is used only for the names of journals, authors, research institutes, and initials of patients. Especially, (discoverers') names in technical terms (e.g. Epstein-Barr virus, Southern blotting) are not tagged with NNP tags.
NNP和NNPS（专有名词）标签仅用于期刊、作者、研究机构以及患者（？）首写字母。特别需要注意的是，专业术语中的名字不会被标记上NNP标签。
We tried to eliminate SYM tags as much as possible.
我们尽可能的淘汰掉了SYM标签。

See the annotation guideline for the detail. The abstracts are first tagged by the JunK tagger and then corrected by human annotators.

可以从标注指南中看出更多细节。这些摘要先由JunK标记，然后由标注人员进行更正。

Examples

Corpus format

语料库格式

The corpus is available in two formats, both included in the package available for download below.

这个语料库可以有以下两种格式，都包括在下边供下载的包中。

PTB-like format: The file contains one token/POS pair per line, and a "==========" line (ten equal signs) is put between sentences.
PTB-like格式：这个文件中每一行都有一对token/POS，以及每两句中间都有一个“==========”（10个等号）
"Merged" gpml format: The POS information is merged into GENIA corpus ver 3.02 using <w> tag which surrounds the token, where the POS is represented as the value of "c" attribute.
“Merged” gpml 格式：POS信息被合并到GENIA语料库3.02版（用<w>标签将分词括起来），POS被表示为C属性。

In the merged format, but not in the PTB-like format, there are some tokens which are assigned "*" as POS. This occurs when a token is split by <term> tags assigned by the annotators of original GENIA corpus. In such cases, the last fragment of a split token is assigned the original POS tag assigned by POS annotators, and other fragments are assigned "*", e.g. <w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.

在合并格式，并非PTB-like格式中，当一个分词被由原始GENIA语料库标注器给出的<term>标签分割，它的POS就是“*”。这种情况下，一个分词的最后一段被POS标注器分配一个初始POS标签，而其他片段被标注为”*”。例如：<w c="*">anti-</w><term sem="#003"><w c='JJ'>IgM</w></term>.

Documentation

文献

Annotation guidelines

标注准则

Tateisi, Yuka and Jun'ichi Tsujii. GENIA Annotation Guidelines for Tokenization and POS tagging. Technical Report (TR-NLP-UT-2006-4). Tsujii Laboratory, University of Tokyo, 2006.

Publications

出版物

Tateisi, Yuka and Jun'ichi Tsujii. Part-of-Speech Annotation of Biology Research Abstracts. In the Proceedings of 4th International Conference on Language Resource and Evaluation (LREC2004). IV. Lisbon, Portugal, pp. 1267-1270, May 2004.

Download

下载