JavaNLP meeting notes for 05/07/03 Today we talked about general prospects for the code currently in JavaNLP and Dan gave a tutorial on the new classify API. We meet again next week (5/14) at our usual time (4-5pm, Gates 498). JavaNLP stats (excluding pubcrawl): - 806 Java files in edu.stanford.nlp - 173374 lines of Java code total - Package with most classes: trees (96) - Package with fewest classes: ie.ner (1) - Package with most lines of code: redwoods (26135) - Package with fewest lines of code: ie.ui (218) - Longest Java file: lexparser.lexparser.FactoredParser (7800) - Shortest Java file: util.XScored (6) Notes on various packages: - cluster: rewrite (Sep's oldest code / broken) - database: just use Maps? - fsm: move with oldparser - lexgram: move with oldparser - linalg: data structures to abstract over matrix package (e.g. colt) - mt (does alignment, need decoder for full mt system) - tagger: need generic tagging API - wsd: "mess" of code for senseval, no real WSD code Highlights of classify tutorial: - code is in classify package and uses Datum/DataCollection from dbm - use ClassifierFactory's trainClassifier method to make a Classifier - use Classifier's classOf(Datum) to perform classification - use Classifier's scoresOf(Datum) to get a Counter from classes to scores - subclass AbstractLinearClassifierFactory if you can for new classifiers Classify todo: - Dan: make Classifier and ClassifierFactory serializable - Dan: prior for Naïve Bayes classifier - Dan/Joseph: use DataCollection code in classify code where appropriate - Kristina: implement DecisionTree with splitting on feature present/absent - Kristina: implement Perceptron over fixed feature vocab (1/0 or freq) - Joseph/Dan: ClassifierTester package for acc/prec/recall/etc See you next week! Thanks, js