The TAC Relation Extraction Dataset

A large-scale relation extraction dataset with 106k+ examples over 42 TAC KBP relation types.

Introduction

TACRED is a large-scale relation extraction dataset with 106,264 examples built over newswire and web text from the corpus used in the yearly TAC Knowledge Base Population (TAC KBP) challenges. Examples in TACRED cover 41 relation types as used in the TAC KBP challenges (e.g., per:schools_attended and org:members) or are labeled as no_relation if no defined relation is held. These examples are created by combining available human annotations from the TAC KBP challenges and crowdsourcing.

Please see our EMNLP paper, or our EMNLP slides for full details.

Note: There is currently a label-corrected version of the TACRED dataset, which you should consider using instead of the original version released in 2017. For more details on this new version, see the TACRED Revisited paper published at ACL 2020.

TACRED was created by sampling sentences where a mention pair was found from the TAC KBP newswire and web forum corpus.

In each TACRED example, the following annotations are provided:

the spans of the subject and object mentions;
the types of the mentions (among 23 fine-grained types used in the Stanford NER system);
the relation held between the entities (among 41 TAC KBP canonical relation types), or no_relation label if no relation was found.

Below are samples from the TACRED dataset:

Data Splits

To miminize dataset bias, we stratify TACRED across years in which the TAC KBP challenge was run:

Split Number of Examples Original Corpus
Train 68,124 TAC KBP 2009–2012
Dev 22,631 TAC KBP 2013
Test 15,509 TAC KBP 2014
Relation Distribution

To ensure that models trained on TACRED are not biased towards predicting false positives on real-world text, we fully annotated all sampled sentences where no relation was found between the mention pairs to be negative examples. As a result, 79.5% of the examples are labeled as no_relation. Among the examples where a relation was found, the distribution of relations is:

Sentence Length Distribution

Compared to previous relation extraction datasets, TACRED contains longer sentences with an average sentence length of 36.4, reflecting the complexity of contexts in which relations occur in real-world text.

TACRED was created with the aim to advance the research of relation extraction and knowledge base population. Therefore at Stanford, we've been using TACRED to (1) benchmark relation extraction models, and (2) train our knowledge base population systems.

Benchmarking Relation Extraction Models

We found that carefully-designed neural models when trained on TACRED can easily outperform patterns or traditional models with manual features.

Model P R F1
Traditional Patterns 86.9 23.2 36.6
Logistic Regression (LR) 73.5 49.9 59.4
LR + Patterns 72.9 51.8 60.5
Neural CNN 75.6 47.5 58.3
LSTM 65.7 59.9 62.7
LSTM + Position-aware attention 65.7 64.5 65.1
Training Knowledge Base Population Systems

We found that our TACRED-powered new KBP system beats the previous state-of-the-art by a large margin.

System Hop-0 Hop-0 + Hop-1
P R F1 P R F1
2015 Winning System (LR + Patterns) 37.5 24.5 29.7 26.6 19.0 22.2
Our Neural System (trained with TACRED) 39.0 28.9 33.2 28.2 21.5 24.4
      + Patterns 40.2 31.5 35.3 29.7 24.2 26.7

For details on the models and experiments, please check out our EMNLP paper.

To respect the copyright of the underlying TAC KBP corpus, TACRED is released via the Linguistic Data Consortium (LDC). Therefore, you can download TACRED from the LDC TACRED webpage. If you are an LDC member, the access will be free; otherwise, an access fee of $25 is needed.

Note: In addition to the original version of TACRED, you should also consider using the new label-corrected version of the TACRED dataset, which fixed a substantial portion of the dev/test labels in the original release. For more details, see the TACRED Revisited paper and their code base.

If you are affiliated with Stanford University, you are most likely qualified for free access of the data. Please contact Stanford Corpus TA for details on how to access the data.

To get started on using TACRED or run the baseline position-aware attention model, you can use our PyTorch code .

Please cite the following paper if you use TACRED in your research:

@inproceedings{zhang2017tacred,
  author = {Zhang, Yuhao and Zhong, Victor and Chen, Danqi and Angeli, Gabor and Manning, Christopher D.},
  booktitle = {Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (EMNLP 2017)},
  title = {Position-aware Attention and Supervised Data Improve Slot Filling},
  url = {https://nlp.stanford.edu/pubs/zhang2017tacred.pdf},
  pages = {35--45},
  year = {2017}
}

If you have questions using TACRED, please email us at:

yuhao ~at~ cs ~dot~ stanford ~dot~ edu