|
|
Named Entity Recognition (NER) and Information Extraction (IE)OverviewWe have worked on a wide range of NER and IE related tasks over the past several years. We entered the 2003 CoNLL NER shared task, using a Character-based Maximum Entropy Markov Model (MEMM). In late 2003 we entered the BioCreative shared task, which aimed at doing NER in the domain of Biomedical papers. This task required identifying genes and proteins, but not distinguishing between the two. We used a similar model as for the CoNLL shared task, but more tuned to the domain and with some additional features; we had the best performing system. Then, in 2004, we entered the BioNLP shared task at CoLing which also looked at Biomedical papers, but required identifying five different classes - DNA, RNA, cell line, cell type, and protein. We once again used an MEMM, but added much richer features, including features from parse trees, the web, and how entities where labeled elsewhere on a previous run. We also entered the PASCAL IE shared task, which involved extracting information from workshop announcements. We attempted to use a relational model in addition to the MEMM to allow the use of top-down information. We have also studied the use of Gibbs sampling for inference in a Conditional Random Field (CRF), so as to incorporate longer distance information. There has also been work on adapting sequence classifiers to new, unseen domains. System performanceDetails of our CMM system's performance on CoNLL 2002 and 2003 NER data are available. Available softwareYou can download our CRF-based NER system.
Papers
|
|
Local links: NLP lunch · PAIL lunch · NLP Reading Group · JavaNLP (javadocs) · machines · Wiki |
Site design by Bill MacCartney |