INTRODUCTION

This release contains all source code necessary to replicate our 
results from the paper on multi-instance multi-label learning for relation 
extraction at EMNLP 2012 (see References).
Additionally, this package contains all the source code for our KBP 
slot-filling system (but not all the data necessary to replicate our KBP 
2011 results - see the Usage section for details).

Please note that this is research code. It is not optimized, slow, not very 
clean, and not well documented. Additionally, it contains several other 
experiments we tried for the TAC-KBP shared task, which are probably not 
relevant for most people. 

Nevertheless, the models we proposed in the EMNLP paper, which is what I think
most people will care about, are implemented relatively cleanly and are 
isolated in the following classes:
The MIML-RE model: 
edu.stanford.nlp.kbp.slotfilling.classify.JointBayesRelationExtractor
The Mintz++ model:
Same class as the above but instantiate it with the parameter onlyLocal = true
Our implementation of the Hoffmann model:
edu.stanford.nlp.kbp.slotfilling.classify.HoffmannExtractor
The classifier we used at KBP 2011:
edu.stanford.nlp.kbp.slotfilling.classify.OneVsAllRelationExtractor

The main entry points for the experiments are:
For the experiments with the Riedel dataset:
edu.stanford.nlp.kbp.slotfilling.MultiR
For the KBP system:
edu.stanford.nlp.kbp.slotfilling.KBPTrainer
See the Usage section for the actual command lines.


AUTHORS

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Sonal Gupta, John Bauer, 
David McClosky, Angel X. Chang, Valentin I. Spitkovsky, and 
Christopher D. Manning


REFERENCES

If you are interested in the model we published at EMNLP 2012, 
please cite this paper:

Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. 
Multi-instance Multi-label Learning for Relation Extraction. 
Proceedings of the 2012 Conference on Empirical Methods in Natural Language 
Processing and Natural Language Learning (EMNLP-CoNLL), 2012.

If you are interested in our KBP system, please cite this paper:

Mihai Surdeanu, Sonal Gupta, John Bauer, David McClosky, Angel X. Chang, 
Valentin I. Spitkovsky, Christopher D. Manning. 
Stanford's Distantly-Supervised Slot-Filling System. 
Proceedings of the TAC-KBP 2011 Workshop, 2011.


ACKNOWLEDGMENTS

We gratefully thank Raphael Hoffmann and Sebastian Riedel for sharing their 
data and code and for the many helpful discussions. This release includes the
data generated by Sebastian Riedel and re-packaged by Raphael Hoffmann 
(available only under certain conditions - ask mihais AT stanford DOT edu for 
details). One of our models 
(edu.stanford.nlp.kbp.slotfilling.classify.HoffmannExtractor) replicates 
Raphael Hoffmann's best model from his ACL 2011 paper.

We thank the organizers of the TAC-KBP shared tasks for all their effort.

We thank SRI (and in particular Lynn Voss) for being very responsive to
Mihai's annoying questions and publicly releasing their gazetteer 
(faust-gazetteer).


LICENSING

Copyright (c) 2009-2012 The Board of Trustees of The Leland Stanford Junior 
University. All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under 
the terms of the GNU General Public License as published by the Free Software 
Foundation; either version 2 of the License, or (at your option) any later 
version.

This program is distributed in the hope that it will be useful, but WITHOUT 
ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS 
FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with 
this program (see the file LICENSE.txt); if not, write to the Free Software 
Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

For the license for all the jar files include in the lib/ directory, please
see the file LIBRARY_LICENSES.txt.


USAGE

If you are interested in the experiments on the Riedel datasat (top part of
Figure 4 in the EMNLP 2012 paper), please note that that corpus is not included
in this distribution due to licensing reasons. Please contact Mihai Surdeanu
(mihais AT stanford DOT edu) for details.

This release includes all relevant source code in the src/ directory and the 
corresponding Java classes in the classes/ directory. The sources were compiled
with Java 1.6. If, for any reason, you decide to recompile, just type "ant all".

To replicate the experiments in our EMNLP paper, please follow the instructions
below. 

------------------------------------------------------------------------------
For the experiments with the Riedel dataset (top part of Figure 4 in the paper),
use the following commands:

To generate the Mintz++ curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \
config/multir/multir_mintz.properties
To generate the MIML-RE curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \
config/multir/multir_mimlre.properties
To generate the MIML-RE At-Least-One curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.MultiR -props \
config/multir/multir_mimlre_atleastone.properties 

The runtime for these models ranges from approximately 30 minutes for the 
Mintz++ model to approximately 6 hours for the MIML-RE models. At the end of
each run, the system prints the precision, recall and F1 scores for the last 
point in the P/R curve, and the location of the file with the data for the 
entire P/R curve. For example, the last two lines in the output for the 
MIML-RE run are:

P 0.2803560076287349 R 0.22615384615384615 F1 0.2503548112404201
P/R curve values saved in file corpora/multir/multir_JOINT_BAYES_T5_E15_NF5_Fall_M1_Istable_Ytrue.curve

In the .curve file, the precision values are in column 3, the recall values 
are in column 5, and the corresponding F1 scores in column 7.
Additionally, each run saves the models learned (after each epoch where 
applicable) in the corpora/multir directory, in files with the same prefix as 
the .curve file but with the .ser extension. Because of this, any subsequent 
run of the above commands will be much faster (a couple of minutes).

------------------------------------------------------------------------------
For the experiments with the KBP dataset (bottom part of Figure 4 in the EMNLP 
2012 paper), please follow the instructions below. 

To generate the Hoffmann curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_hoffmann.properties
To generate the Mintz++ curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_mintz.properties
To generate the MIML-RE curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_mimlre.properties
To generate the MIML-RE At-Least-One curve:
./run.sh edu.stanford.nlp.kbp.slotfilling.KBPTrainer -props \
config/kbp/kbp_mimlre_atleastone.properties

The runtime for these models ranges from approximately 3 hours for the 
Mintz++ model and our implementation of the Hoffmann model to 20 hours for 
the MIML-RE model.
At the end of each run, the system prints the KBP score for the last point 
in the P/R curve, and the location of the file with the data for the entire
P/R curve. For example, the last lines in the output for the MIML-RE run are:

2010 scores:
	Recall: 177 / 576 = 0.30729166
	Precision: 177 / 728 = 0.24313186
	F1: 0.2714724
Jun 9, 2012 4:59:44 PM edu.stanford.nlp.kbp.slotfilling.common.Log severe
SEVERE: P/R curve data generated in file: corpora/kbp/mimlre.curve

The format of the .curve file is the same as above. Similarly with the MultiR
runs, the KBP runs save the models learned (after each epoch where applicable)
in the corpora/kbp directory, in files with the prefix given by the 
serializedRelationExtractorPath property and with the .ser extension. Because
of this, any subsequent run of the above commands will be much faster (less 
than 10 minutes). Note that these runs generate a few additional files, with 
names starting with the prefix given by the value of the kbp.runid property. 
These files serve only debug purposes and can be safely removed.

------------------------------------------------------------------------------
This package is insufficient to replicate our KBP 2011 results. The release
includes our entire KBP source code but not all the data necessary for the 
experiments. For example, for our EMNLP 2012 experiments we fetched a maximum 
of 50 sentences per entity from Wikipedia and the official KBP corpus. For the
KBP experiments we fetched up to 500 sentences per entity and, additionally, 
we used data from web snippets. To repeat these experiments, you would need
access to all our indices, which are very large (hundreds of GB). If you are
seriously interested in this project, please contact me (mihais AT stanford 
DOT edu) directly, and we will arange the transfer of data.