Welcome to JavaNLP

This page is intended to be a guide for new members of Java NLP. As well as helpful how-to information, it contains some important information on policies and practice. Please email suggestions/corrections to David Hall.

Quickstart

For the impatient: use the Quickstart to set up with a barebones JavaNLP checkout. Be sure to read the rest of this document after you're finished, as it contains important rules and guidelines.

Introduction

What is Java NLP?

Java NLP is a subgroup of Stanford's NLP group. We share a repository of utility and NLP-related java code, which we are constantly improving and increasing.

Most code falls into one of three categories. There is utility code, which is not NLP specific. The second category is usable NLP code. We have a state-of-the-art parser, part-of-speech tagger, and sequence classifier (for doing tasks such as named entity recognition) to name a few. The last category is works-in-progress. People store their research code in Java NLP so that they can easily access other code.

Sometimes, parts of the repository are released to the public, on the software page.

What are the benefits of being a member?

The primary benefit of being a member is that you have access to lots of other people's really useful code. Would you like to use a sentence's parse tree in your current project? A word's part-of-speech tag? A nice way of keeping/manipulating counts of things? Java NLP has these things and many more. Also, you know the authors, so getting explanations of how to make something work is easy. A secondary benefit, is that you have more eyes looking for, and sometimes fixing, bugs in your code. Oh, and you get free pizza.

What are the drawbacks of being a member?

As a member you must attend regular meetings. As of summer 2008, we meet at 2:00PM on Thursdays in Building 460, Room 126 (Margaret Jacks Hall, Linguistics Conference Room). At these meetings we discuss ways to improve Java NLP, which usually consist of adding/removing/merging code, adding documentation and reorganizing code structure. We assign tasks at these meetings and you should try to complete your task by the next meeting.

What are the alternatives to membership?

Sometimes people working with the NLP group have just been users of the code, and that's okay for particular projects if you just want to use a tool. But even as a user you should make sure you are familiar with and abide with the contents of this document. Note: JavaNLP is typically not available to people not working with the Stanford NLP Group.

Getting Started

Source code management

We use Subversion to manage our source code. Please read the JavaNLP SCM guide.

IntelliJ

Because Java is such a defective language, some people find it useful to compensate by using an IDE such as IntelliJ. IntelliJ has a lot of useful features. It can do tasks like refactoring for you. Look at the IntelliJ guide for more advice on getting started.

ant

We use "ant" to compile our code. Ant is a more verbose, less functional version of make for Java. Its inane XML syntax is pretty much a direct attack on productivity. We use it because the people who set it up have graduated and no one wants to touch it now.

Classpaths

In order to get the repository to compile, your classpath needs to be set correctly. It must include the jars for all of the third-party resources that we use. Just add the line source /u/nlp/bin/setup.csh to your .cshrc file. For bash users, add setup.bash instead.

Command line completion

William (a previous JavaNLP Czar) wrote a helper so that bash can auto-complete Java class names. Run ant complete from your JavaNLP directory for instructions.

Resources

JavaDocs

We have JavaDocs for the repository. When you update/add javaDocs in your code, you can update the JavaDocs on the web immediately. But the JavaDocs are rebuilt each night, so most people just wait 24 hours.

Repository Browser

View the SVN repository.

Our Machines

See the machine info page.

Useful JavaNLP Classes

An introduction to some of the more important classes in JavaNLP: JavaNLP classes

Rules

Repository Must Compile

The repository MUST compile. Make sure your code compiles before you commit it. If you edit/add more than one file, make sure you commit all of the files. If you make a change to code that other code uses, make sure to fix that code as well.

Repository Should Not Be Distributed

You should not give or distribute any part of the JavaNLP repository to third parties without first consulting with the JavaNLP membership. You should also make sure that checked out copies of JavaNLP code are appropriately protected so that not just anyone can read them. Where possible we try to make JavaNLP code available, and quite a bit of it is available on the Software page, but there are a number of dimensions to be respected, including grants with restrictive intellectual property clauses, unfinished and unpublished research in progress, and just the wishes of the primary authors of the code.

Rudimentary Javadoc should exist

It's good if the JavaNLP Javadoc is useful and usable. We're not always perfect about providing good documentation for everything (athough it is certainly a good thing to aspire to). However, you should make sure that you at least do the following things:

  1. Put a package.html file in any new package saying what it is about and what the main classes are.
  2. Put a class Javadoc comment on each class saying what it does
  3. Put an @author tag on each class saying who the main authors and updaters have been. In general, feel free to add your name as an author: it's most useful to interpret this field as "this person knows this class well", and so if you've rewritten parts of something or added a few methods, do add your name.

Tabs

Thou shalt use space characters instead of tabs for indentation. Thou shalt also use an indent width of 2 spaces. See javaNLP coding conventions.

Log Messages

When you commit files, Subversion asks for a log message. Please put a human-understandable and specific log message. These help other people know what's happening, will likely help you if you have to go back and figure something out about your code. They don't have to be long or fancy, or contain correct grammar, they just need to be a little bit informative.

svn diff

Be aware of what files you are changing how when you are doing a commit. It's easy to accidentally commit code with half-done changes or adding debugging that you did not intend to commit, but which you happened to write last week and which is still in your JavaNLP directory. It's a really good idea to do a svn status and svn diff before any svn commit so you really know for sure what changes you are committing to the repository.

java-nlp-list

As either a member or a user, you must subscribe to the java-nlp-list mailing list (send a message body of subscribe java-nlp-list to majordomo@lists.stanford.edu. This is the only way you'll hear that you've broken code or that there are problems or when meetings are.

Things You Should Know About

Counter

About half of what we do is counting words, and doing things to word counts. It will be worth your time to look over the Counter class (in the stats package). Counters are Maps from Objects to doubles, and are useful for, well, counting things. We have two kinds of Counters: ClassicCounter and OpenAddressCounter. Use OpenAddressCounter if you aren't sure. Each counter lets you increment and decrement counts. The Counters utility class (stats.Counters) allows you to view the average, min, argmin, max, argmax. You can retrieve Objects whose counts are above or below a threshold. The point is, it's useful.

ObjectBank

Because code is frequently written for one input format, and then subsequently used on other input formats, code can get very messy. In an attempt to deal with this we developed the ObjectBank class (in the objectbank package), which is designed to simplify adding allowable input formats. See the JavaDocs for more details/examples.

What is dev?

"dev" is a sister repository to Java NLP where people can store code that they are working on, but aren't yet ready to add to Java NLP. Every member who wants one can have their own subdirectory in dev in which to keep whatever they wish. Code in dev can, and usually will, make use of code in Java NLP. While Java NLP is always expected to compile, this is not the case with dev. But be careful - if someone changes code in Java NLP which makes code of yours which is in Java NLP not compile, they will also go into your code and update so that it compiles again. They will do no such thing in your dev directory, meaning its possible for you to write something in dev which compiles. step out for a bathroom break, come back and have it no longer compile. So you should be motivated to get things from dev into Java NLP as quickly as possible.

Useful non-JavaNLP Tips

Viewing/Killing Jobs

There are many ways to view and kill jobs. The easiest is probably top, which you can type at the command to see the top jobs running on the machine, who's running them, how long they've been running, how much of the CPU and memory they're taking up, etc. You can type 'u' and then a username to get that person's jobs. Also, each job will have a number next to it, to kill a job type 'k' and then the number. And don't worry, you can't kill other people's jobs.

Leaving Programs Running After You Logoff

If you want a process to not be terminated when you log off, nohup it: nohup java myJavaProgram &

Which machine should I run my job on?

See William's guide to working with multiple machines.

More Questions?

Don't be shy! If you have a question, the easiest way to get an answer is probably to ask at a JavaNLP meeting or to write to the java-nlp-list mailing list, where multiple people can help. Or, if you have other questions, you can try David, the Java NLP czar, or Chris [manning@cs.stanford.edu], our beloved advisor.