Marti Hearst
University of California, Berkeley


Towards Semi-Supervised Algorithms for Semantic Relation Detection in BioScience Text



Abstract

A crucial step toward the goal of automatic extraction of propositional information from natural language text is the identification of semantic relations between constituents in sentences. In the bioscience text domain, we have developed a simple ontology-based algorithm for determining which semantic relation holds between terms in noun compounds, and a supervised learning algorithm for discovering relations between entities. In this talk, I will first briefly describe these results. A major bottleneck for semantic labeling work is the development of labeled training data. To remedy this, we propose a new approach for creating semantically-labeled data that makes use of what we call *citances*: the text of the sentences surrounding citations to research articles. Citances provide us with differently-worded statements of approximately the same semantic information; by looking at the way that different authors talk about the same facts, we obtain paraphrases nearly for free. We have just begun to assess how well citances work for the creation of labeled training data for the problem of detecting protein-protein interaction relations. We also hypothesize that citances will be useful for synonym creation, document summarization, and database curation.

Joint work with Preslav Nakov, Barbara Rosario, Ariel Schwartz, and Janice Hamer. This work is part of the BioText project, supported by NSF DBI-0317510.