JavaNLP Classes

An executable example of how to use many of the common building blocks of JavaNLP is samples.AJavaNLPIntroduction.java. This page covers the same classes, but with a little more description of what kinds of things JavaNLP has and when you might want to use them.

Counters

JavaNLP provides a few classes for the most important task in NLP: counting stuff. Yes, pretty much everything you write will end up using one of these. The three main classes you should be aware of are:

Here's a simple example of counting words and calculating the entropy of the resulting word distribution:

  Counter<String> wordCounts = new OpenAddressCounter<String>();
  for (String word: words) {
    wordCounts.incrementCount(word);
  }
  double entropy = Counters.entropy(wordCounts);
Counters, in particular, is worth taking a little time to wander through. For example, the above code could have been written even simpler using Counters.asCounter, and in addition to entropy, you can also calculate the max, argmax, mean, L2Norm, klDivergence, etc.

Maps with Complex Values

You will often want maps whose values are other complex objects, like lists, maps and counters. For these purposes, you will want to use util.DefaultValuedMap.

For example, if you wanted to count the distribution of words at each possible word length, you could write code like:

  Map<Integer, Counter<String>> lengthWordCounts = DefaultValuedMap.counterValuedMap();
  for (String word: words) {
    lengthWordCounts.get(word.length()).incrementCount(word);
  }
DefaultValuedMap is great for e.g. two-, three- or n-dimensional counters, collection-valued maps, combinations of these, and much more. Basically, any time you're using a map and you'd like default values to automatically be filled in, you should be using one.

Note also that you can easily get arbitrarily nested maps by using the asDefault() method:

  Map<String, DefaultValuedMap<String, DefaultValuedMap<String, List<String>>>> map =
    DefaultValuedMap.<String, String>arrayListValuedMap()
    .<String>asDefault()
    .asDefault();
  map.get("a").get("b").get("c").add("d");

Reading from and Writing to Files

The io.IOUtils class provides a number of convenience methods for dealing with input and output. Some of the more noteworthy ones are illustrated below.

Reading the lines of a file

IOUtils.readLines(String) produces an Iterable of the lines in the file. This makes working over files line-by-line much easier than the usual BufferedReader approach:

  for (String line: IOUtils.readLines("some_file.txt")) {
    // do something
  }

Reading and writing objects

Serializing and deserializing objects in Java usually requires a bit of work putting together the various input and output stream wrappers, casting objects, etc. This boilerplate is taken care of for you by IOUtils.writeObjectToFile and IOUtils.readObjectFromFile:

  // create the object
  Map<Integer, Counter<String>> counts = DefaultValuedMap.counterValuedMap();
  ...
    
  // save the object to disk
  IOUtils.writeObjectToFile(counts, "counts.ser");
    
  // load the object from disk
  counts = IOUtils.readObjectFromFile("counts.ser");

Command Line Arguments

JavaNLP uses the ResearchAssistant library for handling command line arguments. If you're writing a class with a main method, you'll probably want to be using this. Typical usage involves adding annotations to static variables to be filled in, and then calling Arguments.parse at the beginning of the main method:

public class MyClass {

  // declare a command line argument for the input file
  @Argument.Switch({"-i", "--input"})
  @Argument("file to calculate the counts from")
  @Argument.Policy(ArgumentPolicy.REQUIRED)
  public static String INPUT_FILE = null;

  // declare a command line argument for the output file
  @Argument.Switch({"-o", "--output"})
  @Argument("file to write the counts to")
  public static String OUTPUT_FILE = "output.ser"; 
  
  public static void main(String[] args) throws Exception {
    Arguments.parse(args, MyClass.class);
    ...
  }
}
There's more information about parsing command line arguments at the ResearchAssistant site.

Additional Information

You may also want to glance over some older slides on JavaNLP classes.