JavaNLP meeting notes for 01/22/03 This was the first JavaNLP meeting of '03. We plan to have one next week (Wednesday, 4pm, Gates 498) but in general this quarter will be lighter because of all the paper deadlines. We're focusing for now on small targeted improvements, and agreed on the following tasks to complete by next meeting: EVERYONE: - add @author tags to all the classes and packages for which you are responsible. This is to identify who to contact when someone has an issue/question with a file. An author tag looks like this: @author Joseph Smarr (jsmarr@stanford.edu) and you put it at the end of all class comments (at the top of the java file) and at the end of package comments. If you have more than one author for a file, just include multiple author lines (the convention is newer authors go below older authors). There are plenty of examples with author tags already (see for instance src/edu/stanford/nlp/ie/hmm/Extractor.java). PLEASE TAKE A FEW MINUTES AND ADD AUTHOR TAGS TO ALL THE FILES FOR WHICH YOU ARE RESPONSIBLE. Just write it out once then copy and paste it a bunch. It will take no time and it will really help us manage the files. By next Wednesday, every file and package should have at least one author tag! --- Sep: - Add 20 news groups to /u/nlp/data and create dbm classes for reading them in. We want to use this as a proof-of-concept for text categorization using the new document and classifier framework. Joseph: - Fix AbstractDataCollection so that it stores its data in itself (and not in a data field) - Look at creating a DocumentCollection class for storing text documents, with some of the general-purpose functionality that's currently in Corpus (vocab, splits, etc) - Consider creating a simple XML schema for documents and document collections that we can convert other file formats to Dan: - make oldparser use the new interface - fix API for parse methods - add a work limit param that returns the PCFG - add word length range args (like -ms and -mx in java) - make sure TaggedWords work - modularize language-dependent features of parser - fix other misc. files for I/O stuff (readers etc) Kristina: - generalize code for log linear model with text file input - provide documentation for using log linear code Roger: - parser stuff Huy: - hmm experiments Teg: - remove all uses of gnu.regexp from JavaNLP, using java.util.regex instead - look for potential uses of JavaNLP code to CS276B project - convert /u/nlp/data/iedata/acquisitions.txt to XML format (to be specified) Cindy: - fix FrameNet parser XML issues (consider using standard java XML packages) Chris: - integrate Tim's new PTB code into JavaNLP PTB Tokenizer I will send out email on Monday asking for a progress report, which I will use to plan how we spend time at the next meeting. The tasks above should only take you a few hours per person, so please try to make the time. And whatever you do, get the @author tags in! :) Thanks, js