edu.stanford.nlp.trees.international.arabic
Class IBMArabicEscaper

java.lang.Object
  extended by edu.stanford.nlp.trees.international.arabic.IBMArabicEscaper
All Implemented Interfaces:
Function<List<HasWord>,List<HasWord>>, Serializable

public class IBMArabicEscaper
extends Object
implements Function<List<HasWord>,List<HasWord>>

This escaper deletes the '#' and '+' symbols that the IBM segmenter uses to mark prefixes and suffixes, since they're not present in the Penn Arabic treebank materials (though later we might try adding them), and escapes the parenthesis characters.

Author:
Christopher Manning
See Also:
Serialized Form

Constructor Summary
IBMArabicEscaper()
           
 
Method Summary
 List<HasWord> apply(List<HasWord> arg)
          Note: At present this clobbers the input list items.
static void main(String[] args)
          This main method preprocesses 1-sentence per line input, making similar changes.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

IBMArabicEscaper

public IBMArabicEscaper()
Method Detail

apply

public List<HasWord> apply(List<HasWord> arg)
Note: At present this clobbers the input list items. This should be fixed.

Specified by:
apply in interface Function<List<HasWord>,List<HasWord>>

main

public static void main(String[] args)
                 throws IOException
This main method preprocesses 1-sentence per line input, making similar changes.

Parameters:
args - A list of filenames. The files must be UTF-8 encoded.
Throws:
IOException - If there are any issues


Stanford NLP Group