Stanford Topic Modeling Toolbox

Printer Icon [www.famfamfam.com] Print-friendly version Screen version Version 0.2.1

Upgrading from 0.1.x? Make sure to update your scripts to match the examples on this page. Some small changes are not backwards compatible with previous versions.

The Stanford Topic Modeling Toolbox (TMT) brings topic modeling tools to social scientists and others who wish to perform analysis on datasets that have a substantial textual component. The toolbox features that ability to:

The Stanford Topic Modeling Toolbox was written at the Stanford NLP group by:
Daniel Ramage and Evan Rosen, first released in September 2009.

Programmers: See the API Reference and the source code.

Getting started

This section contains software installation instructions and the overviews the basic mechanics of running the toolbox.

More ...

Prerequisites

  • A text editor for creating TMT processing scripts.
    TMT scripts are written in Scala, but no knowledge of Scala is required to get started.
  • An installation of Java 6SE or greater: download.
    Windows, Mac, and Linux are supported.

Installation

Example data files and scripts

These two downloads should go into the same folder so that the script can find the data.

example-0-test.scala

// TMT Example 0 - Basic data loading

import scalanlp.stage.source._;

val pubmed = CSVFile("pubmed-oa-subset.csv");

println("Success: " + pubmed + " contains " + pubmed.data.size + " records");
  

This is a simple script that just loads the records contained in the sample data file pubmed-oa-subset, an subset of the Open Access database of publications in Pubmed.

Running the toolbox

Now run the toolbox as before and select "Open script ..." from the file menu. Navigate to example-0-test.scala and click open, then run.

Alternatively, you can run the script from the command line:

  java -jar tmt-0.2.1.jar example-0-test.scala

If all goes well you should see the following lines of output:

  Success: CSVFile("pubmed-oa-subset.csv") contains 1550 records

You're all set to continue with the tutorial. For the rest of the tutorial, invoke the toolbox in the same way as we do above but with a different script name.

[close section]

Preparing a dataset

The first step in using the Topic Modeling Toolbox on a data file (CSV or TSV, e.g. as exported by Excel) is to tell the toolbox where to find the text in the file. This section describes how the toolbox converts a column of text from a file into a sequence of words.

More ...

The code for this example is in example-1-dataset.scala.

The process of extracting and preparing text from a CSV file can be thought of as a pipeline, where a raw CSV file goes through a series of stages that ultimately result in something that can be used to train the topic model. Here is a sample pipeline for the pubmed-oa.csv data file:

// input file to read
val pubmed = CSVFile("pubmed-oa-subset.csv");

// the text field extracted and processed from the file
val text = {
    pubmed ~>                            // read from the pubmed file
    Column(3) ~>                         // select column three, the abstracts
    CaseFolder ~>                        // lowercase everything
    SimpleEnglishTokenizer ~>            // tokenize on spaces characters
    WordsAndNumbersOnlyFilter ~>         // ignore non-words and non-numbers
    TermCounter ~>                       // collect counts (needed below)
    TermMinimumLengthFilter(3) ~>        // take terms with >=3 characters
    TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
    TermDynamicStopListFilter(30) ~>     // filter out 30 most common terms
    DocumentMinimumLengthFilter(5)       // take only docs with >=5 terms
}

The input data file (in the pubmed variable) is a pointer to the CSV file you downloaded earlier. pubmed is passed through a a series of stages that each transform, filter, or otherwise interact with the data.

Selecting data from a CSV file

The first step is to select fields from your CSV file that contain the text you would like to use for training the model.

If your text data is only in one column:

  CSVFile("your-csv-file.csv") ~> Column(3)

The code above will load the text from column three in the CSV file.

If your text is in more than one column:

  CSVFile("your-csv-file.csv") ~> Columns(2,3) ~> Join(" ")

The code above select columns two and three, and then concatenates their contents with a space character used as glue.

Tokenizing

The next set of manipulations involves breaking up the text into its component words, a process known as tokenization. This is accomplished with:

  ... ~> CaseFolder ~> SimpleEnglishTokenizer ~> WordsAndNumbersOnlyFilter ~> ...

The CaseFolder is first used to make "The" and "tHE" and "THE" all look like "the" — i.e. the case folder reduces the number of distinct word types seen by the model by turning all character to lowercase.

Next, the text is tokenized using the SimpleEnglishTokenizer, which removes punctuation from the ends of words and then splits up the input text by whitespace characters (tabs, spaces, carriage returns, etc.). You could alternatively use the WhitespaceTokenizer if your text fields have already been processed into cleaned tokens.

Words that are entirely punctuation and other non-word non-number characters are removed from the generated lists of tokenized documents by using the WordsAndNumbersOnlyFilter.

Finding meaningful words

LDA can be useful for extracting patterns in meaningful word use, but it is not good at determining which words are meaningful. It is often the case that the use of very common words like 'the' do not indicate the type of similarity between documents in which one is interested. Single letters or other small sequences are also rarely useful for understanding content. To lead LDA towards extracting patterns among meaningful words, we have implemented a collection of standard heuristics:

  ... ~> TermCounter ~>
         TermMinimumLengthFilter(3) ~>
         TermMinimumDocumentCountFilter(4) ~>
         TermDynamicStopListFilter(30) ~> ...

The code above removes terms that are shorter than three characters (removing, e.g. words like "is"), words that appear in less than four documents (because very rare words tell us little about the similarity of documents), and 30 most common words in the corpus (because words that are ubiquitous also tell us little about the similarity of documents, they are removed and conventionally denoted "stop words"). These values might need to be updated if you are working with a much larger or much smaller corpus than a few thousand documents.

The TermCounter stage must first computes some statistics needed by the next stages. These statistics are stored in the metadata associated with each parcel, which enables any downstream stage to access that information.

Removing Empty Documents

Some documents in your dataset may be missing or empty (now that some words were filtered in the last step). We can disregard these documents during training by applying the DocumentMinimumLengthFilter(length) to remove all documents shorter than the specified length.

Putting it all together

Run example 1 (example-1-dataset.scala). This program will first load the data pipeline and then print out information about the loaded text dataset, including a signature of the dataset (the "parcel") and the list of 30 stop words found for this corpus. [Note that in PubMed, "gene" is filtered out because it is so common!]

[close section]

Learning a topic model

Once you've prepared a dataset to learn against, you're all set to train a topic model. This example shows how to train an instance of Latent Dirichlet Allocation using the dataset you prepared above.

More ...

The code for this example is in example-2-lda-learn.scala

Load the data file

// input file to read
val pubmed = CSVFile("pubmed-oa-subset.csv");

// the text field extracted and processed from the file
val text = {
    pubmed ~>                            // read from the pubmed file
    Column(3) ~>                         // select column three, the abstracts
    CaseFolder ~>                        // lowercase everything
    SimpleEnglishTokenizer ~>            // tokenize on spaces characters
    WordsAndNumbersOnlyFilter ~>         // ignore non-words and non-numbers
    TermCounter ~>                       // collect counts (needed below)
    TermMinimumLengthFilter(3) ~>        // take terms with >=3 characters
    TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
    TermDynamicStopListFilter(30) ~>     // filter out 30 most common terms
    DocumentMinimumLengthFilter(5)       // take only docs with >=5 terms
}
This code snippet is the same as in the previous tutorial. It extracts and prepares the text from column 3.

Select parameters for training an LDA model

// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);
 
// define the model parameters
val numTopics = 30;
val modelParams = LDA.ModelParams(numTopics);
   // this is equivalent to:
   //
   // val modelParams = LDA.ModelParams(numTopics,
   //                                   LDA.TermSmoothing(.01),
   //                                   LDA.TopicSmoothing(50.0 / numTopics));
  
// define the training parameters
val trainingParams = GibbsLDATrainer.DefaultTrainingParams;
    // this is equivalent to:
    //
    // import GibbsLDA.LearningModel._;
    // val trainingParams =
    //   TrainingParams(MaxIterations(1500),
    //                  SaveEvery(50, LogProbabilityEstimate,
    //                                DocumentTopicDistributions,
    //                                DocumentTopicAssignments));
    //
    // SaveEvery(...) could be replaced by SaveFinal() to write less output
    //
    // val trainingParams = TrainingParams(MaxIterations(1500), SaveFinal());

Train the model to fit the documents

// Name of the output model folder to generate
val output = file("lda-"+dataset.signature+"-"+modelParams.signature);

// Trains the model: the model (and intermediate models) are written to the
// output folder.  If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainGibbsLDA(output, dataset, modelParams, trainingParams); // renamed 0.2

// new in 0.2 - load the per-word topic assignments saved during training
//   (averages across last 10 saved models)0
val perDocWordTopicProbability =
  LoadTrainingPerWordTopicDistributions(output, dataset, 10);
 
// new in 0.2 - write per-document topic usage to file 
DocumentTopicUsage(perDocWordTopicProbability) | CSVFile(output, "usage.csv");

The model will output status message as it trains. It'll take a few minutes. The last lines generate a file in the output folder that contains the per-document topic distribution of each training document. This file has the same number of rows as the original input (with empty rows for documents that were skipped during training), so it can be aligned with the original input CSV using a simple copy and paste.

A tour of the generated output folder

The generated model output folder, in this case lda-f7a35bfa-30-2b517070-7c1f94d2, contains everything needed to analyze the learning process and to load the model back in from disk.

dataset.txt The history of stages used to get the text used for training.
model-params.txt The model parameters specified during training.
training-params.txt The training parameters used to determine convergence.
00000 - 01500 napshots of the model during training every 50 iterations.

Determining if the model has converged

A simple way to see if the training procedure on the model has converged is to look at the values in the numbered folders of log-probability-estimate.txt. This file contains an informal estimate of the model's estimation of the probability of the data while training. The numbers tend to make a curve that tapers off but never stops changing completely. If the numbers don't look like they've stabilized, you might want to retrain using a higher number of iterations. If you re-run the script with a higher number of iterations (or if continue training a model that was interrupted during traiing), the toolbox will continue training the model from the highest iteration stored on disk.

[close section]

Selecting model parameters

This tutorial describes how to select model parameters such as the number of topics by a (computationally intensive) tuning procedure, which searches for the parameters that minimize the model's perplexity on held-out data.

More ...

The code for this example is in example-4-lda-select.scala

The script splits a document into two subsets: one used for training models, the other used for evaluating their perplexity on unseen data. Perplexity is scored on the evaluation documents by first splitting each document in half. The per-document topic distribution is estimated on the first half of the words. The toolbox then computes an average of how surprised it was by the words in the second half of the document, where surprise is measured in the number of equiprobable word choices, on average. The value is written into each trained model's output folder as perplexity.txt, with lower numbers meaning a surer model.

The perplexity scores are not comparable across corpora because they will be affected by different vocabulary size. However, they can be used to compare models trained on the same data (as in the example script). However, be aware that models with better perplexity scores don't always produce more interpretable topics or topics better suited to a particular task. Perplexity scores can be used as stable measures for picking among alternatives, for lack of a better option.

Some non-parametric topic models can automatically select the number of topics as part of the model training procedure itself. However, these models (such as the Hierarchical Dirichlet Process) are not yet implemented in the toolbox. Even in such models, some parameters remain to be tuned, such as the topic smoothing and term smoothing parameters.

Generating topic model outputs

This tutorial shows how to generate basic outputs by querying the topic model for information about topic and word usage in various subsets of the data.

More ...

The code for this example is in example-3-lda-infer.scala

Load the text as before

// input file to read
val pubmed = CSVFile("pubmed-oa-subset.csv");

// the text field extracted and processed from the file
val text = {
    pubmed ~>                            // read from the pubmed file
    Column(3) ~>                         // select column three, the abstracts
    CaseFolder ~>                        // lowercase everything
    SimpleEnglishTokenizer ~>            // tokenize on spaces characters
    WordsAndNumbersOnlyFilter ~>         // ignore non-words and non-numbers
    TermCounter ~>                       // collect counts (needed below)
    TermMinimumLengthFilter(3) ~>        // take terms with >=3 characters
    TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
    TermDynamicStopListFilter(30) ~>     // filter out 30 most common terms
    DocumentMinimumLengthFilter(5)       // take only docs with >=5 terms
}

// turn the text into a dataset ready to be used with LDA
val dataset = LDADataset(text);

Re-load the trained model

// the path to the model we want to load
val modelPath = file("lda-f7a35bfa-30-2b517070-7c1f94d2");

// load the trained model
val model = LoadGibbsLDA(modelPath);

Infer per-document per-word distributions over latent topics

// infer topic distributions for each word in each document in the dataset.
System.err.println("Running inference ... (this could take several minutes)");
val perDocWordTopicProbability = InferPerWordTopicDistributions(model, dataset);

Generating global topic statistics

Our implementation currently supports three primary queries on the product of inference.

  • Top-k words per topic
  • Usage of individual words in each topic
  • Usage of topics overall in the corpus
The example snippet saves these outputs into CSV files that can be imported back into Excel.
//
// now build an object to query the inferred outputs
//

System.err.println("Generating general outputs ...");

// build an object to query the model
val fullLDAQuery = LDAQuery(perDocWordTopicProbability);

// write the top 20 words per topic to a csv file
fullLDAQuery.topK(20) | CSVFile("pubmed-topk.csv");

// track some words' usage
fullLDAQuery.trackWords("gene","probability") | CSVFile("pubmed-words.csv");

// write the overall topic usage
fullLDAQuery.usage | CSVFile("pubmed-usage.csv");

The statistic generated is (fractional) documents - i.e. how many documents were assigned to the given topic.

Slicing the LDA query output

Often, more data is associated with each document than just its words. In our example, we also have a year associated with each document. We can query by "slices" of the dataset, which subdivide the counts as generated above by metadata listed in the grouping information of each document.

//
// now build an object to query by a field
//

System.err.println("Generating sliced outputs ...");

// define fields from the dataset we are going to slice against
val year = pubmed ~> Column(1);   // select column 1, the year
             
// create a slice object by binding the output of inference with the fields
val sliceLDAQuery = SlicedLDAQuery(perDocWordTopicProbability, year);

sliceLDAQuery.topK(20) | CSVFile("pubmed-slice-topk.csv");
sliceLDAQuery.trackWords("gene","probability") | CSVFile("pubmed-slice-trackwords.csv");
sliceLDAQuery.usage | CSVFile("pubmed-slice-usage.csv");

The next tutorial shows you what you can do with these outputs in more detail.

[close section]

Analyzing topic model outputs in Excel

The CSV files generated in the previous tutorial can be directly imported into Excel to provide an advanced analysis and plotting platform for understanding, plotting, and manipulating the topic model outputs. If things don't seem to make sense, you might need to try different model parameters.

Note that the screenshots below were based on the output generated in version 0.1.2. As of 0.2, each generated output file has a header line and possibly a few more informative fields.

Training a Labeled LDA model

Labeled LDA is a supervised topic model for credit attribution in multi-labeled corpora [pdf, bib]. If one of the columns in your input text file contains labels or tags that apply to the document, you can use Labeled LDA to discover which parts of each document go with each label, and to learn accurate models of the words best associated with each label globally.

More ...

The code for this example is in example-5-labeled-lda-learn.scala

This example is very similar to the example on training a regular LDA model, except for a few small changes. One is that instead of specifying LDA.ModelParams, we specify LabeledLDA.ModelParams, which doesn't include an option for a number of topics because the topics are assumed to be in one-to-one alignment with the label set (although this assumption is conceptually easy to relax).

To specify a LabeledLDA dataset, we need to tell the toolbox where the text comes from as well as where the labels come from.

// the text field extracted and processed from the file
val text = {
    pubmed ~>                            // read from the pubmed file
    Column(3) ~>                         // select column three, the abstracts
    CaseFolder ~>                        // lowercase everything
    SimpleEnglishTokenizer ~>            // tokenize on spaces characters
    WordsAndNumbersOnlyFilter ~>         // ignore non-words and non-numbers
    TermCounter ~>                       // collect counts (needed below)
    TermMinimumLengthFilter(3) ~>        // take terms with >=3 characters
    TermMinimumDocumentCountFilter(4) ~> // filter terms in <4 docs
    TermDynamicStopListFilter(30) ~>     // filter out 30 most common terms
    DocumentMinimumLengthFilter(5)       // take only docs with >=5 terms
}

// define fields from the dataset we are going to slice against
val year = {
    pubmed ~>                            // read from the pubmed file
    Column(1) ~>                         // take column one, the year
    WhitespaceTokenizer                  // turns label field into an array
}

val dataset = LabeledLDADataset(text, year);

Labeled LDA assumes that each document can use only topics that are named in the label set. Here each document participates in only one label (its year). Years are not particularly interesting labels (versus, say, a field that contained multiple tags describing each paper), but it suffices for this example. This example will eventually be to something based on tagged web pages, but the example above demonstrates the main point for now.

Training a GibbsLabeledLDA model is similar to training a GibbsLDA model.

// Trains the model: the model (and intermediate models) are written to the
// output folder.  If a partially trained model with the same dataset and
// parameters exists in that folder, training will be resumed.
TrainGibbsLabeledLDA(modelPath, dataset, modelParams, trainingParams);

During training, the set of labels on each document is assumed fixed and observed, so the model will not use labels that are not present on the given document. However, after training is complete, you may be interested to see how much each label would be used if all labels were allowed to participate to some extent. To do so, we re-load the trained LabeledLDA model as just a regular LDA model and do inference as before, but using the labeled topics.

// Does inference on the same dataset, this time ignoring the assigned labels
// and letting the model decide which labels to apply
val model = GibbsLDA.loadInferenceModel(modelPath);
val perDocWordTopicProbability = InferPerWordTopicDistributions(model, LDADataset(text));
DocumentTopicUsage(perDocWordTopicProbability) | CSVFile(modelPath, "usage-after-inference.csv");

You can use perDocWordTopicProbability.data.options if you want to directly inspect how each word is assigned to each underlying label.