Named Entity Recognition (NER) and Information Extraction (IE)


Overview

We have worked on a wide range of NER and IE related tasks over the past several years. We entered the 2003 CoNLL NER shared task, using a Character-based Maximum Entropy Markov Model (MEMM). In late 2003 we entered the BioCreative shared task, which aimed at doing NER in the domain of Biomedical papers. This task required identifying genes and proteins, but not distinguishing between the two. We used a similar model as for the CoNLL shared task, but more tuned to the domain and with some additional features; we had the best performing system. Then, in 2004, we entered the BioNLP shared task at CoLing which also looked at Biomedical papers, but required identifying five different classes - DNA, RNA, cell line, cell type, and protein. We once again used an MEMM, but added much richer features, including features from parse trees, the web, and how entities where labeled elsewhere on a previous run. We also entered the PASCAL IE shared task, which involved extracting information from workshop announcements. We attempted to use a relational model in addition to the MEMM to allow the use of top-down information. We have also studied the use of Gibbs sampling for inference in a Conditional Random Field (CRF), so as to incorporate longer distance information. There has also been work on adapting sequence classifiers to new, unseen domains.

System performance

Details of our CMM and CRF systems' performance on CoNLL 2002 and 2003 NER data are available.

Available software

You can download our CRF-based NER system.

Papers

Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. [pdf]
Shipra Dingare, Malvina Nissim, Jenny Finkel, Claire Grover, and Christopher D. Manning. 2004. A System For Identifying Named Entities in Biomedical Text: How Results From Two Evaluations Reflect on Both the System and the Evaluations. Comparative and Functional Genomics 6:77-85. [ps] [pdf]
Shipra Dingare, Jenny Finkel, Malvina Nissim, Christopher Manning, and Claire Grover. 2004. A System For Identifying Named Entities in Biomedical Text: How Results From Two Evaluations Reflect on Both the System and the Evaluations. In The 2004 BioLink meeting: Linking Literature, Information and Knowledge for Biology at ISMB 2004. [ps] [pdf]
Jenny Finkel, Shipra Dingare, Huy Nguyen, Malvina Nissim, Christopher Manning, and Gail Sinclair. 2004. Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web. Joint Workshop on Natural Language Processing in Biomedicine and its Applications at Coling 2004. [ps] [pdf]
Jenny Finkel, Shipra Dingare, Christopher Manning, Malvina Nissim, Beatrice Alex, and Claire Grover. in press. Exploring the Boundaries: Gene and Protein Identification in Biomedical Text. Accepted for publication in BMC Bioinformatics. [ps] [pdf]
Shipra Dingare, Jenny Finkel, Christopher Manning, Malvina Nissim, and Beatrice Alex. 2004. Exploring the Boundaries: Gene and Protein Identification in Biomedical Text. Proceedings of the BioCreative Workshop, Granada. [ps] [pdf]