INTRODUCTION This release contains all source code necessary to replicate our results from the paper on multi-instance multi-label learning for relation extraction at EMNLP 2012 (see References). Additionally, this package contains all the source code for our KBP slot-filling system (but not all the data necessary to replicate our KBP 2011 results - see the Usage section for details). Please note that this is research code. It is not optimized, slow, not very clean, and not well documented. Additionally, it contains several other experiments we tried for the TAC-KBP shared task, which are probably not relevant for most people. Nevertheless, the models we proposed in the EMNLP paper, which is what I think most people will care about, are implemented relatively cleanly and are isolated in the following classes: The MIML-RE model: edu.stanford.nlp.kbp.slotfilling.classify.JointBayesRelationExtractor The Mintz++ model: Same class as the above but instantiate it with the parameter onlyLocal = true Our implementation of the Hoffmann model: edu.stanford.nlp.kbp.slotfilling.classify.HoffmannExtractor The classifier we used at KBP 2011: edu.stanford.nlp.kbp.slotfilling.classify.OneVsAllRelationExtractor The main entry points for the experiments are: For the experiments with the Riedel dataset: edu.stanford.nlp.kbp.slotfilling.MultiR For the KBP system: edu.stanford.nlp.kbp.slotfilling.KBPTrainer See the Usage section for the actual command lines. AUTHORS Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, and Christopher D. Manning REFERENCES If you are interested in the model we published at EMNLP 2012, please cite this paper: Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. Multi-instance Multi-label Learning for Relation Extraction. Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning (EMNLP-CoNLL), 2012. If you are interested in our KBP system, please cite this paper: Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, Valentin I. Spitkovsky, Christopher D. Manning. Stanford's Distantly-Supervised Slot-Filling System. Proceedings of the TAC-KBP 2011 Workshop, 2011. ACKNOWLEDGMENTS We gratefully thank Raphael Hoffmann and Sebastian Riedel for sharing their data and code and for the many helpful discussions. This release includes the data generated by Sebastian Riedel and re-packaged by Raphael Hoffmann (available only under certain conditions - ask mihais AT stanford DOT edu for details). One of our models (edu.stanford.nlp.kbp.slotfilling.classify.HoffmannExtractor) replicates Raphael Hoffmann's best model from his ACL 2011 paper. We thank the organizers of the TAC-KBP shared tasks for all their effort. We thank SRI (and in particular Lynn Voss) for being very responsive to Mihai's annoying questions and publicly releasing their gazetteer (faust-gazetteer). LICENSING Copyright (c) 2009-2012 The Board of Trustees of The Leland Stanford Junior University. All Rights Reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program (see the file LICENSE.txt); if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. For the license for all the jar files include in the lib/ directory, please see the file LIBRARY_LICENSES.txt. USAGE If you are interested in the experiments on the Riedel datasat (top part of Figure 4 in the EMNLP 2012 paper), please note that that corpus is not included in this distribution due to licensing reasons. Please contact Mihai Surdeanu (mihais AT stanford DOT edu) for details. This release includes all relevant source code in the src/ directory and the corresponding Java classes in the classes/ directory. The sources were compiled with Java 1.6. If, for any reason, you decide to recompile, just type "ant all". To replicate the experiments in our EMNLP paper, please follow the instructions below. ------------------------------------------------------------------------------ For the experiments with the Riedel dataset (top part of Figure 4 in the paper), use the following commands: To generate the Mintz++ curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \ config/multir/multir_mintz.properties To generate the MIML-RE curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \ config/multir/multir_mimlre.properties To generate the MIML-RE At-Least-One curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \ config/multir/multir_mimlre_atleastone.properties The runtime for these models ranges from approximately 30 minutes for the Mintz++ model to approximately 6 hours for the MIML-RE models. At the end of each run, the system prints the precision, recall and F1 scores for the last point in the P/R curve, and the location of the file with the data for the entire P/R curve. For example, the last two lines in the output for the MIML-RE run are: P 0.2803560076287349 R 0.22615384615384615 F1 0.2503548112404201 P/R curve values saved in file corpora/multir/multir_JOINT_BAYES_T5_E15_NF5_Fall_M1_Istable_Ytrue.curve In the .curve file, the precision values are in column 3, the recall values are in column 5, and the corresponding F1 scores in column 7. Additionally, each run saves the models learned (after each epoch where applicable) in the corpora/multir directory, in files with the same prefix as the .curve file but with the .ser extension. Because of this, any subsequent run of the above commands will be much faster (a couple of minutes). ------------------------------------------------------------------------------ For the experiments with the KBP dataset (bottom part of Figure 4 in the EMNLP 2012 paper), please follow the instructions below. To generate the Hoffmann curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \ config/kbp/kbp_hoffmann.properties To generate the Mintz++ curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \ config/kbp/kbp_mintz.properties To generate the MIML-RE curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \ config/kbp/kbp_mimlre.properties To generate the MIML-RE At-Least-One curve: ./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \ config/kbp/kbp_mimlre_atleastone.properties The runtime for these models ranges from approximately 3 hours for the Mintz++ model and our implementation of the Hoffmann model to 20 hours for the MIML-RE model. At the end of each run, the system prints the KBP score for the last point in the P/R curve, and the location of the file with the data for the entire P/R curve. For example, the last lines in the output for the MIML-RE run are: 2010 scores: Recall: 177 / 576 = 0.30729166 Precision: 177 / 728 = 0.24313186 F1: 0.2714724 Jun 9, 2012 4:59:44 PM edu.stanford.nlp.kbp.slotfilling.common.Log severe SEVERE: P/R curve data generated in file: corpora/kbp/mimlre.curve The format of the .curve file is the same as above. Similarly with the MultiR runs, the KBP runs save the models learned (after each epoch where applicable) in the corpora/kbp directory, in files with the prefix given by the serializedRelationExtractorPath property and with the .ser extension. Because of this, any subsequent run of the above commands will be much faster (less than 10 minutes). Note that these runs generate a few additional files, with names starting with the prefix given by the value of the kbp.runid property. These files serve only debug purposes and can be safely removed. ------------------------------------------------------------------------------ This package is insufficient to replicate our KBP 2011 results. The release includes our entire KBP source code but not all the data necessary for the experiments. For example, for our EMNLP 2012 experiments we fetched a maximum of 50 sentences per entity from Wikipedia and the official KBP corpus. For the KBP experiments we fetched up to 500 sentences per entity and, additionally, we used data from web snippets. To repeat these experiments, you would need access to all our indices, which are very large (hundreds of GB). If you are seriously interested in this project, please contact me (mihais AT stanford DOT edu) directly, and we will arange the transfer of data.