JavaNLP Meeting Notes 10/01/03 Current tasks: Trees/Parser ------------ Roger: - Tree.java - children() returns empty array - Chinese doesn't break on Windows - currently working on general Tree utils Jeanette: - added ICE-GB reader - working on improving accuracy Teg: - hasn't fully tested ASCII lexicon/grammar reading, but suspects that it works - basically knows what's in Pubcrawl and will report next time - Parser - implemented sum of inside probs so you can calculate P (sentence|grammar) Chris: - Tree.java - implemented getIndexOf(node), daughter(), get(daughter index), and equals() methods Utils ----- Dan: - implemented MutableDouble - implemented Interner based on WeakHashMap (will discuss in future mtg) - implemented ContainmentMap - will improve labeling of parser states - will check in recent changes to parser (should be no dependencies on old stuff) Counter ------- Issues: - keeping track of total difficult to impossible, and currently incorrect - Heaps also have Object -> double mapping, but Counter to heavyweight - need support for mapping n-dimenstional objects to reals - smoothing built into Counter - support for setting default value to return for objects not in keyset - put semantics bad for Mutable doubles/Identity HashMap Resolution: - Map is probably not what we want - Instead, use lightweight weighted set implementation which Counter extends - smoothing/normalization implemented as snapshot Views. Invoke on instance, returns new Counter - calculate total dynamically Action Items: Dan: separate Counter from Weighted Set and clean up Counter implementation Roger: implement GeneralizedWeightedSet Tokenization ------------ - New (better) tokenizer framework implemented by Teg - implements Iterator interface - accepts Readers (instead of Lists/Streams) Issues: - pushback method is annoying to implement in subclasses - pushback arbitrary objects? - DefaultTokenizer doesn't seem to support pushback - DummyTokenizer still has PTBTokenizer documentation - SimpleTokenizer (and possibly PTBTokenizer) seems to slurp entire file/defeating purpose of Reader Resolution: - we should use this new Tokenizer framework - use peek instead of pushback - if someone wants to buffer, they can implement a wrapper Action Items: Teg: Check Simple/PTBTokenizer and make sure they're not slurping entire file Roger: Fix DummyTokenizer documentation Fix references to old io.StreamTokenizer Adapt JFlex/StreamTokenizer to new Tokenizer architecture Clustering: deferred until next meeting