%%%%%% The dataset below is version 1 of the FTD dataset. For details and citations, please see 'Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers. Sonal Gupta, Christopher D. Manning. In the proceedings of IJCNLP 2011'. This data is a revised version of the data used in the paper. Please contact sonal@cs.stanford.edu if you have any questions or corrections. ##W97-0805 Lexical Discrimination With The Italian Version Of WordNet . We present a prototype of the Italian version of WORDNET , a general computational lexical resource . Some relevant extensions are discussed to make it usable for parsing : in particular we add verbal selectional restrictions to make lexical discrimination effective . Italian WORDNET has been coupled with a parser and a number of experiments have been performed to individuate the methodology with the best trade-off between disambiguation rate and precision . Results confirm intuitive hypothesis on the role of selectional restrictions and show evidences for a WORDNET-Iike organization of lexical senses . ##W98-0301 A Surface-Based Approach To Identifying Discourse Markers And Elementary Textual Units In Unrestricted Texts . I present a surface-based algorithm that employs knowledge of cue phrase usages in order to determine automatically clause boundaries and discourse markers in unrestricted natural language texts . The knowledge was derived from a comprehensive corpus analysis . ##C04-1103 Direct Orthographical Mapping For Machine Transliteration . Machine transliteration\/back-transliteration plays an important role in many multilingual speech and language applications . In this paper , a novel framework for machine transliteration\/backtransliteration that allows us to carry out direct orthographical mapping -LRB- DOM -RRB- between two different languages is presented . Under this framework , a joint source-channel transliteration model , also called n-gram transliteration model -LRB- ngram TM -RRB- , is further proposed to model the transliteration process . We evaluate the proposed methods through several transliteration\/backtransliteration experiments for English\/Chinese and English\/Japanese language pairs . Our study reveals that the proposed method not only reduces an extensive system development effort but also improves the transliteration accuracy significantly . ##W02-1032 Exploiting Headword Dependency And Predictive Clustering For Language Modeling . This paper presents several practical ways of incorporating linguistic structure into language models . A headword detector is first applied to detect the headword of each phrase in a sentence . A permuted headword trigram model -LRB- PHTM -RRB- is then generated from the annotated corpus . Finally , PHTM is extended to a cluster PHTM -LRB- C-PHTM -RRB- by defining clusters for similar words in the corpus . We evaluated the proposed models on the realistic application of Japanese Kana-Kanji conversion . Experiments show that C-PHTM achieves 15 % error rate reduction over the word trigram model . This demonstrates that the use of simple methods such as the headword trigram and predictive clustering can effectively capture long distance word dependency , and substantially outperform a word trigram model . ##W01-1411 Towards A Simple And Accurate Statistical Approach To Learning Translation Relationships Among Words . We report on a project to derive word translation relationships automatically from parallel corpora . Our effort is distinguished by the use of simpler , faster models than those used in previous high-accuracy approaches . Our methods achieve accuracy on singleword translations that seems comparable to any work previously reported , up to nearly 60 % coverage of word types , and they perform particularly well on a class of multi-word compounds of special interest to our translation effort . ##N06-1053 Towards Spoken-Document Retrieval For The Internet : Lattice Indexing For Large-Scale Web-Search Architectures . Large-scale web-search engines are generally designed for linear text . The linear text representation is suboptimal for audio search , where accuracy can be significantly improved if the search includes alternate recognition candidates , commonly represented as word lattices . This paper proposes a method for indexing word lattices that is suitable for large-scale web-search engines , requiring only limited code changes . The proposed method , called Time-based Merging for Indexing -LRB- TMI -RRB- , first converts the word lattice to a posterior-probability representation and then merges word hypotheses with similar time boundaries to reduce the index size . Four alternative approximations are presented , which differ in index size and the strictness of the phrase-matching constraints . Results are presented for three types of typical web audio content , podcasts , video clips , and online lectures , for phrase spotting and relevance ranking . Using TMI indexes that are only five times larger than corresponding lineartext indexes , phrase spotting was improved over searching top-1 transcripts by 25-35 % , and relevance ranking by 14 % , at only a small loss compared to unindexed lattice search . ##D08-1055 A Japanese Predicate Argument Structure Analysis using Decision Lists . This paper describes a new automatic method for Japanese predicate argument structure analysis . The method learns relevant features to assign case roles to the argument of the target predicate using the features of the words located closest to the target predicate under various constraints such as dependency types , words , semantic categories , parts of speech , functional words and predicate voices . We constructed decision lists in which these featuresweresortedbytheirlearnedweights . Using our method , we integrated the tasks of semantic role labeling and zero-pronoun identification , and achieved a 17 % improvement compared with a baseline method in a sentence level performance analysis . ##W03-1002 Statistical Machine Translation Using Coercive Two-Level Syntactic Transduction . We define , implement and evaluate a novel model for statistical machine translation , which is based on shallow syntactic analysis -LRB- part-of-speech tagging and phrase chunking -RRB- in both the source and target languages . It is able to model long-distance constituent motion and other syntactic phenomena without requiring a full parse in either language . We also examine aspects of lexical transfer , suggesting and exploring a concept of translation coercion across parts of speech , as well as a transfer model based on lemma-to-lemma translation probabilities , which holds promise for improving machine translation of low-density languages . Experiments are performed in both Arabic-to-English and French-to-English translation demonstrating the efficacy of the proposed techniques . Performance is automatically evaluated via the Bleu score metric . ##N06-1034 Modelling User Satisfaction And Student Learning In A Spoken Dialogue Tutoring System With Generic , Tutoring , And User Affect Parameters . We investigate using the PARADISE framework to develop predictive models of system performance in our spoken dialogue tutoring system . We represent performance with two metrics : user satisfaction and student learning . We train and test predictive models of these metrics in our tutoring system corpora . We predict user satisfaction with 2 parameter types : 1 -RRB- system-generic , and 2 -RRB- tutoringspeci c. To predict student learning , we also use a third type : 3 -RRB- user affect . Alhough generic parameters are useful predictors of user satisfaction in other PARADISE applications , overall our parameters produce less useful user satisfaction models in our system . However , generic and tutoring-speci c parameters do produce useful models of student learning in our system . User affect parameters can increase the usefulness of these models . ##W98-1428 EXEMPLARS : A Practical , Extensible Framework For Dynamic Text Generation . In this paper , we present EXEMPLARS , an object-oriented , rule-based framework designed to support practical , dynamic text generation , emphasizing its novel features compared to . existing hybrid systems that mix template-style and more sophisticated techniques . These features - . include an extensible classification-based text planning mechanism , a definition language that is a superset of the Java language , and advanced support for HTMIdSGML templates . ##I08-1049 Multi-View Co-Training of Transliteration Model . This paper discusses a new approach to training of transliteration model from unlabeled data for transliteration extraction . We start with an inquiry into the formulation of transliteration model by considering different transliteration strategies as a multi-view problem , where each view exploits a natural division of transliteration features , such as phonemebased , grapheme-based or hybrid features . Then we introduce a multi-view Cotraining algorithm , which leverages compatible and partially uncorrelated information across different views to effectively boost the model from unlabeled data . Applying this algorithm to transliteration extraction , the results show that it not only circumvents the need of data labeling , but also achieves performance close to that of supervised learning , where manual labeling is required for all training samples . ##W01-0904 Translating Treebank Annotation For Evaluation . In this paper we discuss the need for corpora with a variety of annotations to provide suitable resources to evaluate different Natural Language Processing systems and to compare them . A supervised machine learning technique is presented for translating corpora between syntactic formalisms and is applied to the task of translating the Penn Treebank annotation into a Categorial Grammar annotation . It is compared with a current alternative approach and results indicate annotation of broader coverage using a more compact grammar . ##J00-3004 A Compression-Based Algorithm For Chinese Word Segmentation . Chinese is written without using spaces or other word delimiters . Although a text may be thought of as a corresponding sequence of words , there is considerable ambiguity in the placement of boundaries . Interpreting a text as a sequence of words is beneficial for some information retrieval and storage tasks : for example,full-text search , word-based compression , and keyphrase extraction . We describe a scheme that infers appropriate positions for word boundaries using an adaptive language model that is standard in text compression . It is trained on a corpus of presegmented text , and when applied to new text , interpolates word boundaries so as to maximize the compression obtained . This simple and general method performs well with respect to specialized schemes for Chinese language segmentation . ##W09-2506 Ranking Paraphrases in Context . We present a vector space model that supports the computation of appropriate vector representations for words in context , and apply it to a paraphrase ranking task . An evaluation on the SemEval 2007 lexical substitution task data shows promising results : the model significantly outperforms a current state of the art model , and our treatment of context is effective . ##E87-1011 A Multi-Purpose Interface To An On-Line Dictionary . We argue that there are two qualitatively different modes of using a machine-readable dictionary in the context of research in computational linguistics : batch processing of the source with the purpose of collating information for subsequent use by a natural language application , and placing the dictionary on-line in an environment which supports fast interactive access to data selected on the basis of a number of linguistic constraints . While it is the former mode of dictionary use which is characteristic of most computational linguistics work to date , it is the latter which has the potential of making maximal use of the information typically found in a machine-readable dictionary . We describe the mounting of the machine-readable source of the Longman Dictionary of Contemporary English on a single user workstation to make it available as a development tool for a number of research projects . ##P00-1020 An Empirical Study Of The Influence Of Argument Conciseness On Argument Effectiveness . We have developed a system that generates evaluative arguments that are tailored to the user , properly arranged and concise . We have also developed an evaluation framework in which the effectiveness of evaluative arguments can be measured with real users . This paper presents the results of a formal experiment we have performed in our framework to verify the influence of argument conciseness on argument effectiveness ##H05-1095 Translating With Non-Contiguous Phrases . This paper presents a phrase-based statistical machine translation method , based on non-contiguous phrases , i.e. phrases with gaps . A method for producing such phrases from a word-aligned corpora is proposed . A statistical translation model is also presented that deals such phrases , as well as a training method based on the maximization of translation accuracy , as measured with the NIST evaluation metric . Translations are produced by means of a beam-search decoder . Experimental results are presented , that demonstrate how the proposed method allows to better generalize from the training data . ##P05-1039 What To Do When Lexicalization Fails : Parsing German With Suffix Analysis And Smoothing . In this paper , we present an unlexicalized parser for German which employs smoothing and suffix analysis to achieve a labeled bracket F-score of 76.2 , higher than previously reported results on the NEGRA corpus . In addition to the high accuracy of the model , the use of smoothing in an unlexicalized parser allows us to better examine the interplay between smoothing and parsing results . ##A97-1022 A Prototype Of A Grammar Checker For Czech . This paper describes the implementation of a prototype of a grammar based grammar checker for Czech and the basic ideas behind this implementation . The demo is implemented as an independent program cooperating with Microsoft Word . The grammar checker uses specialized grammar formalism which generally enables to check errors in languages with a very high degree of word order freedom . ##C02-2009 Machine Translation Based On NLG From XML-DB . The purpose of this study is to propose a new method for machine translation . Wehave proceeded through with two projects for report generation -LRB- Kittredge and Polguere , 2000 -RRB- : Weather Forecast and Monthly Economic Report to be produced in four languages : English , Japanese , French , and German . Their input data is stored in XML-DB . We applied a three-stage pipelined architecture -LRB- Reiter and Dale , 2000 -RRB- , and each stage was implemented as XML transformation processes . Weregard XML stored data as language-neutral intermediate form and employ the so-called ` sublanguage approach ' -LRB- Somers , 2000 -RRB- . The machine translation process is implemented via XMLDB as a kind of interlingua approach instead of the conventional structure transfer approach . ##N09-2036 Faster MT Decoding Through Pervasive Laziness . Syntax-based MT systems have proven effective -- the models are compelling and show good room for improvement . However , decoding involves a slow search . We present a new lazy-search method that obtains significant speedups over a strong baseline , with no loss in Bleu . ##J07-1005 Answering Clinical Questions with Knowledge-Based and Statistical Techniques . The combination of recent developments in question-answering research and the availability of unparalleled resources developed specifically for automatic semantic processing of text in the medical domain provides a unique opportunity to explore complex question answering in the domain of clinical medicine . This article presents a system designed to satisfy the information needs of physicians practicing evidence-based medicine . We have developed a series of knowledge extractors , which employ a combination of knowledge-based and statistical techniques , for automatically identifying clinically relevant aspects of MEDLINE abstracts . These extracted elements serve as the input to an algorithm that scores the relevance of citations with respect to structured representations of information needs , in accordance with the principles of evidencebased medicine . Starting with an initial list of citations retrieved by PubMed , our system can bring relevant abstracts into higher ranking positions , and from these abstracts generate responses that directly answer physicians ' questions . We describe three separate evaluations : one focused on the accuracy of the knowledge extractors , one conceptualized as a document reranking task , and finally , an evaluation of answers by two physicians . Experiments on a collection of real-world clinical questions show that our approach significantly outperforms the already competitive PubMed baseline . ##C08-1046 Japanese Dependency Parsing Using a Tournament Model . In Japanese dependency parsing , Kudo 's relative preference-based method -LRB- Kudo and Matsumoto , 2005 -RRB- outperforms both deterministic and probabilistic CKY-based parsing methods . In Kudo 's method , for each dependent word -LRB- or chunk -RRB- a loglinear model estimates relative preference of all other candidate words -LRB- or chunks -RRB- for being as its head . This can not be considered in the deterministic parsing methods . We propose an algorithm based on a tournament model , in which the relative preferences are directly modeled by one-onone games in a step-ladder tournament . In an evaluation experiment with Kyoto Text Corpus Version 4.0 , the proposed method outperforms previous approaches , including the relative preference-based method . ##W98-0303 Enriching Automated Essay Scoring Using Discourse Marking . Electronic Essay Rater -LRB- e-rater -RRB- is a prototype automated essay scoring system built at Educational Testing Service -LRB- ETS -RRB- that uses discourse marking , in addition to syntactic information and topical content vector analyses to automatically assign essay scores . This paper gives a general description ore-rater as a whole , but its emphasis is on the importance of discourse marking and argument partitioning for annotating the argument structure of an essay . We show comparisons between two content vector analysis programs used to predict scores . EsscQ \/ ` Content and ArgContent . EsscnContent assigns scores to essays by using a standard cosine correlation that treats the essay like a '' ` bag of words . '' in that it does not consider word order . Ark , Content employs a novel content vector analysis approach for score assignment based on the individual arguments in an essay . The average agreement between ArgContent scores and human rater scores is 82 % . as compared to 69 % agreement between EssavContent and the human raters . These results suggest that discourse marking enriches e-rater 's scoring capability . When e-rater uses its whole set of predictive features , agreement with human rater scores ranges from 87 ° , \/ o - 94 % across the 15 sets of essa5 responses used in this study ##W04-2213 Building Parallel Corpora For EContent Professionals . This paper reports on completed work carried out in the framework of the INTERA project , and specifically , on the production of multilingual resources -LRB- LRs -RRB- for eContent purposes . The paper presents the methodology adopted for the development of the corpus -LRB- acquisition and processing of the textual data -RRB- , discusses the divergence of the initial assumptions from the actual situation met during this procedure , and concludes with a summarization of the problems attested which undermine the viability of multilingual parallel corpora construction . ##W09-2603 Mining of Parsed Data to Derive Deverbal Argument Structure . The availability of large parsed corpora and improved computing resources now make it possible to extract vast amounts of lexical data . We describe the process of extracting structured data and several methods of deriving argument structure mappings for deverbal nouns that significantly improves upon non-lexicalized rule-based methods . For a typical model , the F-measure of performance improves from a baseline of about 0.72 to 0.81 . ##C04-1069 Document Re-Ranking Based On Automatically Acquired Key Terms In Chinese Information Retrieval . For Information Retrieval , users are more concerned about the precision of top ranking documents in most practical situations . In this paper , we propose a method to improve the precision of top N ranking documents by reordering the retrieved documents from the initial retrieval . To reorder documents , we first automatically extract Global Key Terms from document set , then use extracted Global Key Terms to identify Local Key Terms in a single document or query topic , finally we make use of Local Key Terms in query and documents to reorder the initial ranking documents . The experiment with NTCIR3 CLIR dataset shows that an average 10 % -11 % improvement and 2 % -5 % improvement in precision can be achieved at top 10 and 100 ranking documents level respectively . ##H92-1046 Lexical Disambiguation Using Simulated Annealing . The resolution of lexical ambiguity is important for most natural language processing tasks , and a range of computational techniques have been proposed for its solution . None of these has yet proven effective on a large scale . In this paper , we describe a method for lexical disambiguation of text using the definitions in a machine-readable dictionary together with the technique of simulated annealing . The method operates on complete sentences and attempts to select the optimal combinations of word senses for all the words in the sentence simultaneously . The words in the sentences may be any of the 28,000 headwords in Longman 's Dictionary of Contemporary English -LRB- LDOCE -RRB- and are disambiguated relative to the senses given in LDOCE . Our initial results on a sample set of 50 sentences are comparable to those of other researchers , and the fully automatic method requires no hand coding of lexical entries , or hand tagging of text . ##D08-1082 A Generative Model for Parsing Natural Language to Meaning Representations . In this paper , we present an algorithm for learning a generative model of natural language sentences together with their formal meaning representations with hierarchical structures . The model is applied to the task of mapping sentences to hierarchical representations of their underlying meaning . We introduce dynamic programming techniques for efficient training and decoding . In experiments , we demonstrate that the model , when coupled with a discriminative reranking technique , achieves state-of-the-art performance when tested on two publicly available corpora . The generative model degrades robustly when presented with instances that are different from those seen in training . This allows a notable improvement in recall compared to previous models . ##C90-2064 The Application Of Two-Level Morphology To Non-Concatenative German Morphology . Introduction In this paper 2 we describe a hybrid system for morphological analysis and synthesis . We call it hybrid because it consists of two separate parts interacting with each other in a welldefined way . The treatment of morphonology and nonoconcatenative morphology is based on the two-level approach originally proposed by Koskenniemi -LRB- 1983 -RRB- . For the concatenative part of morphosyntax -LRB- i.e. affixation -RRB- we make use of a grammar based on feature-unification . tloth parts rely on the same morph lexicon . Combinations of two-level morphology with t ` eature-based morphosyntactic grammars have already been proposed by several authors -LRB- c.f. llear 1988a , Carson 1988 , G6rz & Paulus 1988 , Schiller & Steffens 1990 -RRB- to overcome the shortcomings of the continuation-classes originally proposed by Koskenniemi -LRB- 1983 -RRB- and Karttunen -LRB- 1983 -RRB- for the description of morphosyntax . But up to now no linguistically ~ ; atisfying solution has been proposed for the treatment of non-concatenative morphology in : such a framework . In this paper we describe an extension to the model which will allow for the description of such phenomena . Namely we propose to restrict the applicability of two-level rules by providing them with filters in the form of feature structures . We demonstrate how a well-known problem of German morphology , so-called `` Umlautung '' , can be described in our approach in a linguistically motivated and efficient way . ##W98-0904 Optimal Morphology . Optimal morphology -LRB- OM -RRB- is a finite state formalism that unifies concepts from Optimality Theory -LRB- OT , Prince ~ : Smolensky , 1993 -RRB- and Declarative Phonology -LRB- DP , Scobbie , Coleman Bird , 1996 -RRB- to describe morphophonological alternations in inflectional morphology . Candidate sets are formalized by inviolable lexical constraints which map abstract morpheme signatures to allomorphs . Phonology is implemented as violable rankable constraints selecting optimal candidates from these . Both types of constraints are realized by finite state transducers . Using phonological data from Albanian it is shown that given a finite state lexicalization of candidate outputs for word forms OM allows more natural analyses than unviolable finite state constraints do . Two possible evaluation strategies for OM grammars are considered : the global evaluation procedure from E1lisou -LRB- 1994 -RRB- and a simple strategy of local constraint evaluation . While the OM-specific lexicalization of candidate sets allows straightforward generation and a simple method of morphological parsing even under global evaluation , local constraint evaluation is shown to be preferable empirically and to be formally more restrictive . The first point is illustrated by an account of directionality effects in some classical Mende data . A procedure is given that generates a finite state transducer simulating the effects of local constraint evaluation . Thus local as opposed to global evaluation -LRB- Frank & Satta , 1998 -RRB- seems to guarantee the finite-stateness of the input-output-mapping . ##W07-0723 Getting to Know Moses : Initial Experiments on German-English Factored Translation . We present results and experiences from our experiments with phrase-based statistical machine translation using Moses . The paper is based on the idea of using an offthe-shelf parser to supply linguistic information to a factored translation model and compare the results of German ? English translation to the shared task baseline system based on word form . We report partial results for this model and results for two simplified setups . Our best setup takes advantage of the parser ? s lemmatization and decompounding . A qualitative analysis of compound translation shows that decompounding improves translation quality . ##W07-2044 KU : Word Sense Disambiguation by Substitution . Data sparsity is one of the main factors that make word sense disambiguation -LRB- WSD -RRB- difficult . To overcome this problem we need to find effective ways to use resources other than sense labeled data . In this paper I describe a WSD system that uses a statistical language model based on a large unannotated corpus . The model is used to evaluate the likelihood of various substitutes for a word in a given context . These likelihoods are then used to determine the best sense for the word in novel contexts . The resulting system participated in three tasks in the SemEval 2007 workshop . The WSD of prepositions task proved to be challenging for the system , possibly illustrating some of its limitations : e.g. not all words have good substitutes . The system achieved promising results for the English lexical sample and English lexical substitution tasks . ##W04-2003 A Robust And Hybrid Deep-Linguistic Theory Applied To Large-Scale Parsing . Modern statistical parsers are robust and quite fast , but their output is relatively shallow when compared to formal grammar parsers . We suggest to extend statistical approaches to a more deep-linguistic analysis while at the same time keeping the speed and low complexity of a statistical parser . The resulting parsing architecture suggested , implemented and evaluated here ishighlyrobustandhybridonanumberof levels , combining statistical and rule-based approaches , constituency and dependency grammar , shallow and deep processing , full and nearfull parsing . With its parsing speed of about 300,000 words per hour and state-of-the-art performance the parser is reliable for a number of large-scale applications discussed in the article . ##W04-3249 Unsupervised Domain Relevance Estimation For Word Sense Disambiguation . This paper presents Domain Relevance Estimation -LRB- DRE -RRB- , a fully unsupervised text categorization technique based on the statistical estimation of the relevance of a text with respect to a certain category . We use a pre-de ned set of categories -LRB- we call them domains -RRB- which have been previously associated to WORDNET word senses . Given a certain domain , DRE distinguishes between relevant and non-relevant texts by means of a Gaussian Mixture model that describes the frequency distribution of domain words inside a large-scale corpus . Then , an Expectation Maximization algorithm computes the parameters that maximize the likelihood of the model on the empirical data . The correct identi cation of the domain of the text is a crucial point for Domain Driven Disambiguation , an unsupervised Word Sense Disambiguation -LRB- WSD -RRB- methodology that makes use of only domain information . Therefore , DRE has been exploited and evaluated in the context of a WSD task . Results are comparable to those of state-ofthe-art unsupervised WSD systems and show that DRE provides an important contribution . ##W06-1631 Capturing Out-Of-Vocabulary Words In Arabic Text . The increasing flow of information between languages has led to a rise in the frequency of non-native or loan words , where terms of one language appear transliterated in another . Dealing with such out of vocabulary words is essential for successful cross-lingual information retrieval . For example , techniques such as stemming should not be applied indiscriminately to all words in a collection , and so before any stemming , foreign words need to be identified . In this paper , we investigate three approaches for the identification of foreign words in Arabic text : lexicons , language patterns , and n-grams and present that results show that lexicon-based approaches outperform the other techniques . ##W98-1427 Generation As A Solution To Its Own Problem . Natural language generation technology is now ripe for commercial exploitation , but one of the remaining bottlenecks is that of providing NLG • systems with user-friendly interfaces for Specifying the content of documents to be generated . We present here a new technique we have developed for providing such interfaces : WYSIWYM editing . WYSIWYM -LRB- What You See Is What You Meant -RRB- makes novel use of the system 's generator to provide a natural language input device which requires no NL interpretation ... - only NL generation . ##P97-1005 Automatic Detection Of Text Genre . As the text databases available to users become larger and more heterogeneous , genre becomes increasingly important for computational linguistics as a complement to topical and structural principles of classification . We propose a theory of genres as bundles of facets , which correlate with various surface cues , and argue that genre detection based on surface cues is as successful as detection based on deeper structural properties . ##N09-1063 Hierarchical Search for Parsing . Both coarse-to-fine and A ∗ parsing use simple grammars to guide search in complex ones . We compare the two approaches in a common , agenda-based framework , demonstrating the tradeoffs and relative strengths of each method . Overall , coarse-to-fine is much faster for moderate levels of search errors , but below a certain threshold A ∗ is superior . In addition , we present the first experiments on hierarchical A ∗ parsing , in which computation of heuristics is itself guided by meta-heuristics . Multi-level hierarchies are helpful in both approaches , but are more effective in the coarseto-fine case because of accumulated slack in A ∗ heuristics . ##C00-1075 Application Of Analogical Modelling To Example Based Machine Translation . This paper describes a self-modelling , incremental algorithm for learning translation rules from existing bilingual corpora . The notions of supracontext and subcontext are extended to encompass bilingual information through simultaneous analogy on both source and target sentences and juxtaposition of corresponding results . Analogical modelling is performed during the learning phase and translation patterns are projected in a multi-dimensional analogical network . The proposed fi'amework was evaluated on a small training corpus providing promising results . Suggestions to improve system performance are ##H01-1045 Large Scale Testing Of A Descriptive Phrase Finder . This paper describes an evaluation of an existing technique that locates sentences containing descriptions of a query word or phrase . The experiments expand on previous tests by exploring the effectiveness of the system when searching from a much larger document collection . The results showed the system working significantly better than when searching over smaller collections . The improvement was such , that a more stringent definition of what constituted a correct description was devised to better measure effectiveness . The results also pointed to potentially new forms of evidence that might be used in improving the location process . Keywords Information retrieval , descriptive phrases , WWW . ##D09-1071 The infinite HMM for unsupervised PoS tagging . We extend previous work on fully unsupervised part-of-speech tagging . Using a non-parametric version of the HMM , called the infinite HMM -LRB- iHMM -RRB- , we address the problem of choosing the number of hidden states in unsupervised Markov models for PoS tagging . We experiment with two non-parametric priors , the Dirichlet and Pitman-Yor processes , on the Wall Street Journal dataset using a parallelized implementation of an iHMM inference algorithm . We evaluate the results with a variety of clustering evaluation metrics and achieve equivalent or better performances than previously reported . Building on this promising result we evaluate the output of the unsupervised PoS tagger as a direct replacement for the output of a fully supervised PoS tagger for the task of shallow parsing and compare the two evaluations . ##P05-2018 Centrality Measures In Text Mining : Prediction Of Noun Phrases That Appear In Abstracts . In this paper , we study different centrality measures being used in predicting noun phrases appearing in the abstracts of scientific articles . Our experimental results show that centrality measures improve the accuracy of the prediction in terms of both precision and recall . We also found that the method of constructing Noun Phrase Network significantly influences the accuracy when using the centrality heuristics itself , but is negligible when it is used together with other text features in decision trees . ##D08-1096 A Graph-theoretic Model of Lexical Syntactic Acquisition . This paper presents a graph-theoretic model of the acquisition of lexical syntactic representations . The representations the model learns are non-categorical or graded . We propose a new evaluation methodology of syntactic acquisition in the framework of exemplar theory . When applied to the CHILDES corpus , the evaluation shows that the model 's graded syntactic representations perform better than previously proposed categorical representations . ##W97-1311 Event Coreference For Information Extraction . We propose a general approach for performing event coreference and for constructing complex event representations , such as those required for information extraction tasks . Our approach is based on a representation which allows a tight coupling between world or conceptual modelling and discourse modelling . The representation and the coreference mechanism are fully implemented within the LaSIE information extraction system where the mechanism is used for both object -LRB- noun phrase -RRB- and event coreference resolution . Indirect evaluation of the approach shows small , but significant benefit , for information extraction tasks . ##P05-3018 Word Alignment And Cross-Lingual Resource Acquisition . Annotated corpora are valuable resources for developing Natural Language Processing applications . This work focuses on acquiring annotated data for multilingual processing applications . We present an annotation environment that supports a web-based user-interface for acquiring word alignments between English and Chinese as well as a visualization tool for researchers to explore the annotated data . ##A00-1046 The Efficiency Of Multimodal Interaction For A Map-Based Task . This paper compares the efficiency of using a standard direct-manipulation graphical user interface -LRB- GUI -RRB- with that of using the QuickSet pen\/voice multimodal interface for supporting a military task . In this task , a user places military units and control measures -LRB- e.g. , various types of lines , obstacles , objectives -RRB- on a map . Four military personnel designed and entered their own simulation scenarios via both interfaces . Analyses revealed that the multimodal interface led to an average 3.5-fold speed improvement in the average entity creation time , including all error handling . The mean time to repair errors also was 4.3 times faster when interacting multimodally . Finally , all subjects reported a strong preference for multimodal interaction . These results indicate a substantial efficiency advantage for multimodal over GUI-based interaction during map-based tasks . ##W98-0319 Lexical , Prosodic , And Syntactic Cues For Dialog Acts . The structure of a discourse is reflected in many aspects of its linguistic realization , including its lexical , prosodic , syntactic , and semantic nature . Multiparty dialog contains a particular kind of discourse structure , the dialog act -LRB- DA -RRB- . Like other types of structure , the dialog act sequence of a conversation is also reflected in its lexical , prosodic , and syntactic realization . This paper presents a preliminary investigation into the realization of a particular class of dialog acts which play an essential structuring role in dialog , the backchannels or acknowledgements tokens . We discuss the lexical , prosodic , and syntactic realization of these and subsumed or related dialog acts like continuers , assessments , yesanswers , agreements , and incipient-speakership . We show that lexical knowledge plays a role in distinguishing these dialog acts , despite the widespread ambiguity of words such as yeah , and that prosodic knowledge plays a role in DA identification for certain DA types , while lexical cues may be sufficient for the remainder . Finally , our investigation of the syntax of assessments suggests that at least some dialog acts have a very constrained syntactic realization , a per-dialog act ` microsyntax ' . ##P03-1044 Counter-Training In Discovery Of Semantic Patterns . This paper presents a method for unsupervised discovery of semantic patterns . Semantic patterns are useful for a variety of text understanding tasks , in particular for locating events in text for information extraction . The method builds upon previously described approaches to iterative unsupervised pattern acquisition . One common characteristic of prior approaches is that the output of the algorithm is a continuous stream of patterns , with gradually degrading precision . Our method differs from the previous pattern acquisition algorithms in that it introduces competition among several scenarios simultaneously . This provides natural stopping criteria for the unsupervised learners , while maintaining good precision levels at termination . We discuss the results of experiments with several scenarios , and examine different aspects of the new procedure . ##N06-3006 Detecting Emotion In Speech : Experiments In Three Domains . The goal of my proposed dissertation work is to help answer two fundamental questions : -LRB- 1 -RRB- How is emotion communicated in speech ? and -LRB- 2 -RRB- Does emotion modeling improve spoken dialogue applications ? In this paper I describe feature extraction and emotion classi cation experiments I have conducted and plan to conduct on three different domains : EPSaT , HMIHY , and ITSpoke . In addition , I plan to implement emotion modeling capabilities into ITSpoke and evaluate the effectiveness of doing so . ##E09-1056 Improvements in Analogical Learning : Application to Translating Multi-Terms of the Medical Domain . Handling terminology is an important matter in a translation workflow . However , current Machine Translation -LRB- MT -RRB- systems do not yet propose anything proactive upon tools which assist in managing terminological databases . In this work , we investigate several enhancements to analogical learning and test our implementation on translating medical terms . We show that the analogical engine works equally well when translating from and into a morphologically rich language , or when dealing with language pairs written in different scripts . Combining it with a phrasebased statistical engine leads to significant improvements . ##W06-1112 A Structural Similarity Measure . This paper outlines a measure of language similarity based on structural similarity of surface syntactic dependency trees . Unlike the more traditional string-based measures , this measure tries to reflect `` deeper '' correspondences among languages . The development of this measure has been inspired by the experience from MT of syntactically similar languages . This experience shows that the lexical similarity is less important than syntactic similarity . This claim is supported by a number of examples illustrating the problems which may arise when a measure of language similarity relies too much on a simple similarity of texts in different languages . ##C04-1159 Dependency Structure Analysis And Sentence Boundary Detection In Spontaneous Japanese . This paper describes a project to detect dependencies between Japanese phrasal units called bunsetsus , and sentence boundaries in a spontaneous speech corpus . In monologues , the biggest problem with dependency structure analysis is that sentence boundaries are ambiguous . In this paper , we propose two methods for improving the accuracy of sentence boundary detection in spontaneous Japanese speech : One is based on statistical machine translation using dependency information and the other is based on text chunking using SVM . An F-measure of 84.9 was achieved for the accuracy of sentence boundary detection by using the proposed methods . The accuracy of dependency structure analysis was also improved from 75.2 % to 77.2 % by using automatically detected sentence boundaries . The accuracy of dependency structure analysis and that of sentence boundary detection were also improved by interactively using both automatically detected dependency structures and sentence boundaries . ##P07-1085 Unsupervised Language Model Adaptation Incorporating Named Entity Information . Language model -LRB- LM -RRB- adaptation is important for both speech and language processing . It is often achieved by combining a generic LM with a topic-specific model that is more relevant to the target document . Unlike previous work on unsupervised LM adaptation , this paper investigates how effectively using named entity -LRB- NE -RRB- information , instead of considering all the words , helps LM adaptation . We evaluate two latent topic analysis approaches in this paper , namely , clustering and Latent Dirichlet Allocation -LRB- LDA -RRB- . In addition , a new dynamically adapted weighting scheme for topic mixture models is proposed based on LDA topic analysis . Our experimental results show that the NE-driven LM adaptation framework outperforms the baseline generic LM . The best result is obtained using the LDA-based approach by expanding the named entities with syntactically filtered words , together with using a large number of topics , which yields a perplexity reduction of 14.23 % compared to the baseline generic LM . ##W99-0908 Text Classification By Bootstrapping With Keywords , EM And Shrinkage . When applying text classification to complex tasks , it is tedious and expensive to hand-label the large amounts of training data necessary for good performance . This paper presents an alternative approach to text classification that requires no labeled documentsi instead , it uses a small set of keywords per class , a class hierarchy and a large quantity of easilyobtained unlabeled documents . The keywords are used to assign approximate labels to the unlabeled documents by termmatching . These preliminary labels become the starting point for a bootstrapping process that learns a naive Bayes classifier using Expectation-Maximization and hierarchical shrinkage . When classifying a complex data set of computer science research papers into a 70-leaf topic hierarchy , the keywords alone provide 45 % accuracy . The classifier learned by bootstrapping reaches 66 % accuracy , a level close to human agreement . ##N03-2028 LM Studies On Filled Pauses In Spontaneous Medical Dictation . We investigate the optimal LM treatment of abundant filled pauses -LRB- FP -RRB- in spontaneous monologues of a professional dictation task . Questions addressed here are -LRB- 1 -RRB- how to deal with FP in the LM history and -LRB- 2 -RRB- to which extent can the LM distinguish between positions with high and low FP likelihood . Our results differ partly from observations reported on dialogues . Discarding FP from all LM histories clearly improves the performance . Local perplexities , entropies and word rankings at positions following FP suggest that most FP indicate hesitations rather than restarts . Proper prediction of FP allows to distinguish FP from word positions by a doubled FP probability . Recognition experiments confirm the improvements found in our perplexity studies . ##P06-1092 Phoneme-To-Text Transcription System With An Infinite Vocabulary . The noisy channel model approach is successfully applied to various natural language processing tasks . Currently the main research focus of this approach is adaptation methods , how to capture characteristics of words and expressions in a target domain given example sentences in that domain . As a solution we describe a method enlarging the vocabulary of a language model to an almost infinite size and capturing their context information . Especially the new method is suitable for languages in which words are not delimited by whitespace . We applied our method to a phoneme-to-text transcription task in Japanese and reduced about 10 % of the errors in the results of an existing method . ##C08-1131 Measuring and Predicting Orthographic Associations : Modelling the Similarity of Japanese Kanji . As human beings , our mental processes for recognizing linguistic symbols generate perceptual neighborhoods around such symbols where confusion errors occur . Such neighborhoods also provide us with conscious mental associations between symbols . This paper formalises orthographic models for similarity of Japanese kanji , and provides a proofof-concept dictionary extension leveraging the mental associations provided by orthographic proximity . ##P06-1044 Automatic Classification Of Verbs In Biomedical Texts . Lexical classes , when tailored to the application and domain in question , can provide an effective means to deal with a number of natural language processing -LRB- NLP -RRB- tasks . While manual construction of such classes is difficult , recent research shows that it is possible to automatically induce verb classes from cross-domain corpora with promising accuracy . We report a novel experiment where similar technology is applied to the important , challenging domain of biomedicine . We show that the resulting classification , acquired from a corpus of biomedical journal articles , is highly accurate and strongly domainspecific . It can be used to aid BIO-NLP directly or as useful material for investigating the syntax and semantics of verbs in biomedical texts . ##P04-1026 Linguistic Profiling For Authorship Recognition And Verification . A new technique is introduced , linguistic profiling , in which large numbers of counts of linguistic features are used as a text profile , which can then be compared to average profiles for groups of texts . The technique proves to be quite effective for authorship verification and recognition . The best parameter settings yield a False Accept Rate of 8.1 % at a False Reject Rate equal to zero for the verification task on a test corpus of student essays , and a 99.4 % 2-way recognition accuracy on the same corpus . ##I05-1054 Machine Translation Based on Constraint-Based Synchronous Grammar . This paper proposes a variation of synchronous grammar based on the formalism of context-free grammar by generalizing the first component of productions that models the source text , named Constraint-based Synchronous Grammar -LRB- CSG -RRB- . Unlike other synchronous grammars , CSG allows multiple target productions to be associated to a single source production rule , which can be used to guide a parser to infer different possible translational equivalences for a recognized input string according to the feature constraints of symbols in the pattern . Furthermore , CSG is augmented with independent rewriting that allows expressing discontinuous constituents in the inference rules . It turns out that such grammar is more expressive to model the translational equivalences of parallel texts for machine translation , and in this paper , we propose the use of CSG as a basis for building a machine translation -LRB- MT -RRB- system for Portuguese to Chinese translation . ##P05-1032 Scaling Phrase-Based Statistical Machine Translation To Larger Corpora And Longer Phrases . In this paper we describe a novel data structure for phrase-based statistical machine translation which allows for the retrieval of arbitrarily long phrases while simultaneously using less memory than is required by current decoder implementations . We detail the computational complexity and average retrieval times for looking up phrase translations in our suffix array-based data structure . We show how sampling can be used to reduce the retrieval time by orders of magnitude with no loss in translation quality . ##J91-4002 Systemic Classification And Its Efficiency . This paper examines the problem of classifying linguistic objects on the basis of information encoded in the system network formalism developed by Halliday . It is shown that this problem is NP-hard , and a restriction to the formalism , which renders the classification problem soluble in polynomial time , is suggested . An algorithm for the unrestricted classification problem , which separates a potentially expensive second stage from a more tractable first stage , is then presented . ##I05-1088 Extracting Terminologically Relevant Collocations in the Translation of Chinese Monograph . This paper suggests a methodology which is aimed to extract the terminologically relevant collocations for translation purposes . Our basic idea is to use a hybrid method which combines the statistical method and linguistic rules . The extraction system used in our work operated at three steps : -LRB- 1 -RRB- Tokenization and POS tagging of the corpus ; -LRB- 2 -RRB- Extraction of multi-word units using statistical measure ; -LRB- 3 -RRB- Linguistic filtering to make use of syntactic patterns and stop-word list . As a result , hybrid method using linguistic filters proved to be a suitable method for selecting terminological collocations , it has considerably improved the precision of the extraction which is much higher than that of purely statistical method . In our test , hybrid method combining `` Log-likelihood ratio '' and `` linguistic rules '' had the best performance in the extraction . We believe that terminological collocations and phrases extracted in this way , could be used effectively either to supplement existing terminological collections or to be used in addition to traditional reference works . ##W97-0906 Practical Considerations In Building A Multi-Lingual Authoring System For Business Letters . The paper describes the experiences of a multi-national consortium in an on-going project to construct a multilingual authoring tool for business letters . The consortium consists of two universities -LRB- both with significant experience in language engineering -RRB- , three software companies , and various potential commercial users with the organizations being located in a total of four countries . The paper covers the history of the development of the project from an academic idea but focuses on the implications of the user-requirements orientated outlook of the commercial developers and the implications of this view for the system architecture , user requirements , delivery platforms and so on . Particularly interesting consequences of the user requirements are the database centred architecture , and the constraints and opportunities this presents for development of grammatical components at both the text and sentence level . ##W97-1006 Method For Improving Automatic Word Categorization . This paper presents a new approach to automatic word categorization which improves both the efficiency of the algorithm and the quality of the formed clusters . The unigram and the bigram statistics of a corpus of about two million words are used with an efficient distance function to measure the similarities of words , and a greedy algorithm to put the words into clusters . The notions of fuzzy clustering like cluster prototypes , degree of membership are used to form up the clusters . The algorithm is of unsupervised type and the number of clusters are determined at run-time . ##P05-3008 A Voice Enabled Procedure Browser For The International Space Station . Clarissa , an experimental voice enabled procedure browser that has recently been deployed on the International Space Station -LRB- ISS -RRB- , is to the best of our knowledge the first spoken dialog system in space . This paper gives background on the system and the ISS procedures , then discusses the research developed to address three key problems : grammarbased speech recognition using the Regulus toolkit ; SVM based methods for open microphone speech recognition ; and robust side-effect free dialogue management for handling undos , corrections and confirmations . ##W07-1026 Automatic Indexing of Specialized Documents : Using Generic vs. Domain-Specific Document Representations . The shift from paper to electronic documents has caused the curation of information sources in large electronic databases to become more generalized . In the biomedical domain , continuing efforts aim at refining indexing tools to assist with the update and maintenance of databases such as MEDLINE ® . In this paper , we evaluate two statistical methods of producing MeSH ® indexing recommendations for the genetics literature , including recommendations involving subheadings , which is a novel application for the methods . We show that a generic representation of the documents yields both better precision and recall . We also find that a domainspecific representation of the documents can contribute to enhancing recall . ##C08-1042 Evaluating Unsupervised Part-of-Speech Tagging for Grammar Induction . This paper explores the relationship between various measures of unsupervised part-of-speech tag induction and the performance of both supervised and unsupervised parsing models trained on induced tags . We find that no standard tagging metrics correlate well with unsupervised parsing performance , and several metrics grounded in information theory have no strong relationship with even supervised parsing performance . ##C96-2104 A Portable And Quick Japanese Parser : QJP . QJP is a portable and quick softwaxe module for Japanese processing . QJP analyzes a Japanese sentence into segmented morphemes\/words with tags and a syntactic bunsetsu kakari-uke structure based on the two strategies , a -RRB- Morphological analysis based on character-types and functional-words and b -RRB- Syntactic analysis by simple treatment of structural ambiguities and ignoring semantic information . QJP is small , fast and robust , because 1 -RRB- dictionary size -LRB- less than1 100KB -RRB- and required memory size -LRB- 260KB -RRB- are very small , 2 -RRB- analysis speed is fast -LRB- more than 100 words\/see on 80486-PC -RRB- , and 3 -RRB- even a 100-word long sentence containing unknown words is easily processed . Using QJP and its ana -RRB- ysis results as a base and adding other functions for processing Japanese documents , a valqety of applications can be developed on UNIX workstations or even on PCs . ##J94-4001 A Syntactic Analysis Method Of Long Japanese Sentences Based On The Detection Of Conjunctive Structures . This paper presents a syntactic analysis method that first detects conjunctive structures in a sentence by checking parallelism of two series of words and then analyzes the dependency structure of the sentence with the help of the information about the conjunctive structures . Analysis of long sentences is one of the most difficult problems in natural language processing . The main reason for this difficulty is the structural ambiguity that is common for conjunctive structures that appear in long sentences . Human beings can recognize conjunctive structures because of a certain , but sometimes subtle , similarity that exists between conjuncts . Therefore , we have developed an algorithm for calculating a similarity measure between two arbitrary series of words from the left and the right of a conjunction and selecting the two most similar series of words that can reasonably be considered as composing a conjunctive structure . This is realized using a dynamic programming technique . A long sentence can be reduced into a shorter form by recognizing conjunctive structures . Consequently , the total dependency structure of a sentence can be obtained by relatively simple head-dependent rules . A serious problem concerning conjunctive structures , besides the ambiguity of their scopes , is the ellipsis of some of their components . Through our dependency analysis process , we can find the ellipses and recover the omitted components . We report the results of analyzing 150Japanese sentences to illustrate the effectiveness of this method . ##W97-0803 Extending A Thesaurus By Classifying Words . This paper proposes a method for extending an existing thesaurus through classification of new words in terms of that thesaurus . New words are classified on the basis of relative probabilities of . a word belonging to a given word class , with the probabilities calculated using nounverb co-occurrence pairs . Experiments using the Japanese Bunruigoihy5 thesaurus on about 420,000 co-occurrences showed that new words can be classified correctly with a maximum accuracy of more than 80 % . ##H05-1059 Bidirectional Inference With The Easiest-First Strategy For Tagging Sequence Data . This paper presents a bidirectional inference algorithm for sequence labeling problems such as part-of-speech tagging , named entity recognition and text chunking . The algorithm can enumerate all possible decomposition structures and find the highest probability sequence together with the corresponding decomposition structure in polynomial time . We also present an efficient decoding algorithm based on the easiest-first strategy , which gives comparably good performance to full bidirectional inference with significantly lower computational cost . Experimental results of part-of-speech tagging and text chunking show that the proposed bidirectional inference methods consistently outperform unidirectional inference methods and bidirectional MEMMs give comparable performance to that achieved by state-of-the-art learning algorithms including kernel support vector machines . ##W02-2001 Extracting The Unextractable : A Case Study On Verb-Particles . This paper proposes a series of techniques for extracting English verb -LCB- particle constructions from raw text corpora . We initially propose three basic methods , based on tagger output , chunker output and a chunk grammar , respectively , with the chunk grammar method optionally combining with an attachment resolution module to determine the syntactic structure of verb -LCB- preposition pairs in ambiguous constructs . We then combine the three methods together into a single classifler , and add in a number of extra lexical and frequentistic features , producing a flnal F-score of 0.865 over the WSJ . ##N07-1038 Multiple Aspect Ranking Using the Good Grief Algorithm . We address the problem of analyzing multiple related opinions in a text . For instance , in a restaurant review such opinions may include food , ambience and service . We formulate this task as a multiple aspect ranking problem , where the goal is to produce a set of numerical scores , one for each aspect . We present an algorithm that jointly learns ranking models for individual aspects by modeling the dependencies between assigned ranks . This algorithm guides the prediction of individual rankers by analyzing meta-relations between opinions , such as agreement and contrast . We prove that our agreementbased joint model is more expressive than individual ranking models . Our empirical results further con rm the strength of the model : the algorithm provides signi cant improvement over both individual rankers and a state-of-the-art joint ranking model . ##W09-0412 NUS at WMT09 : Domain Adaptation Experiments for English-Spanish Machine Translation of News Commentary Text . We describe the system developed by the team of the National University of Singapore for English to Spanish machine translation of News Commentary text for the WMT09 Shared Translation Task . Our approach is based on domain adaptation , combining a small in-domain News Commentary bi-text and a large out-of-domain one from the Europarl corpus , from which we built and combined two separate phrase tables . We further combined two language models -LRB- in-domain and out-of-domain -RRB- , and we experimented with cognates , improved tokenization and recasing , achieving the highest lowercased NIST score of 6.963 and the second best lowercased Bleu score of 24.91 % for training without using additional external data for English-toSpanish translation at the shared task . ##W07-2094 UPAR7 : A knowledge-based system for headline sentiment tagging . For the Affective Text task at SemEval2007 , University Paris 7 's system first evaluates emotion and valence on all words of a news headline -LRB- using enriched versions of SentiWordNet and a subset of WordNetAffect -RRB- . We use a parser to find the head word , considering that it has a major importance . We also detect contrasts -LRB- between positive and negative words -RRB- that shift valence . Our knowledge-based system achieves high accuracy on emotion and valence annotation . These results show that working with linguistic techniques and a broad-coverage lexicon is a viable approach to sentiment analysis of headlines . 1 Introduction 1.1 Objectives The detection of emotional connotations in texts is a recent task in computational linguistics . Its economic stakes are promising ; for example , a company could detect , by analyzing the blogosphere , people 's opinion on its products . The goal of the SemEval task is to annotate news headlines for emotions -LRB- using a predefined list : anger , disgust , fear , joy , sadness & surprise -RRB- , and for valence -LRB- positive or negative -RRB- . A specific difficulty here is related to the small number of words available for the analysis . ##P07-1024 Optimizing Grammars for Minimum Dependency Length . We examine the problem of choosing word order for a set of dependency trees so as to minimize total dependency length . We present an algorithm for computing the optimal layout of a single tree as well as a numerical method for optimizing a grammar of orderings over a set of dependency types . A grammar generated by minimizing dependency length in unordered trees from the Penn Treebank is found to agree surprisingly well with English word order , suggesting that dependency length minimization has influenced the evolution of English . ##D07-1071 Online Learning of Relaxed CCG Grammars for Parsing to Logical Form . We consider the problem of learning to parse sentences to lambda-calculus representations of their underlying semantics and present an algorithm that learns a weighted combinatory categorial grammar -LRB- CCG -RRB- . A key idea is to introduce non-standard CCG combinators that relax certain parts of the grammar -- for example allowing flexible word order , or insertion of lexical items -- with learned costs . We also present a new , online algorithm for inducing a weighted CCG . Results for the approach on ATIS data show 86 % F-measure in recovering fully correct semantic analyses and 95.9 % F-measure by a partial-match criterion , a more than 5 % improvement over the 90.3 % partial-match figure reported by He and Young -LRB- 2006 -RRB- . ##H05-1013 A Large-Scale Exploration Of Effective Global Features For A Joint Entity Detection And Tracking Model . Entity detection and tracking -LRB- EDT -RRB- is the task of identifying textual mentions of real-world entities in documents , extending the named entity detection and coreference resolution task by considering mentions other than names -LRB- pronouns , de nite descriptions , etc. -RRB- . Like NE tagging and coreference resolution , most solutions to the EDT task separate out the mention detection aspect from the coreference aspect . By doing so , these solutions are limited to using only local features for learning . In contrast , by modeling both aspects of the EDT task simultaneously , we are able to learn using highly complex , non-local features . We develop a new joint EDT model and explore the utility of many features , demonstrating their effectiveness on this task . ##W06-2109 German Particle Verbs And Pleonastic Prepositions . This paper discusses the behavior of German particle verbs formed by two-way prepositions in combination with pleonastic PPs including the verb particle as a preposition . These particle verbs have a characteristic feature : some of them license directional prepositional phrases in the accusative , some only allow for locative PPs in the dative , and some particle verbs can occur with PPs in the accusative and in the dative . Directional particle verbs together with directional PPs present an additional problem : the particle and the preposition in the PP seem to provide redundant information . The paper gives an overview of the semantic verb classes in uencing this phenomenon , based on corpus data , and explains the underlying reasons for the behavior of the particle verbs . We also show how the restrictions on particle verbs and pleonastic PPs can be expressed in a grammar theory like Lexical Functional Grammar -LRB- LFG -RRB- . ##P04-3023 On The Equivalence Of Weighted Finite-State Transducers . Although they can be topologically different , two distinct transducers may actually recognize the same rational relation . Being able to test the equivalence of transducers allows to implement such operations as incremental minimization and iterative composition . This paper presents an algorithm for testing the equivalence of deterministic weighted finite-state transducers , and outlines an implementation of its applications in a prototype weighted finite-state calculus tool . ##E06-1006 Phrase-Based Backoff Models For Machine Translation Of Highly Inflected Languages . We propose a backoff model for phrasebased machine translation that translates unseen word forms in foreign-language text by hierarchical morphological abstractions at the word and the phrase level . The model is evaluated on the Europarl corpus for German-English and FinnishEnglish translation and shows improvements over state-of-the-art phrase-based models . ##P98-1075 Growing Semantic Grammars . A critical path in the development of natural language understanding -LRB- NLU -RRB- modules lies in the difficulty of defining a mapping from words to semantics : Usually it takes in the order of years of highly-skilled labor to develop a semantic mapping , e.g. , in the form of a semantic grammar , that is comprehensive enough for a given domain . Yet , due to the very nature of human language , such mappings invariably fail to achieve full coverage on unseen data . Acknowledging the impossibility of stating a priori all the surface forms by which a concept can be expressed , we present GsG : an empathic computer system for the rapid deployment of NLU front-ends and their dynamic customization by non-expert end-users . Given a new domain for which an NLU front-end is to be developed , two stages are involved . In the authoring stage , GSQ aids the developer in the construction of a simple domain model and a kernel analysis grammar . Then , in the run-time stage , GSG provides the enduser with an interactive environment in which the kernel grammar is dynamically extended . Three learning methods are employed in the acquisition of semantic mappings from unseen data : -LRB- i -RRB- parser predictions , -LRB- ii -RRB- hidden understanding model , and -LRB- iii -RRB- end-user paraphrases . A baseline version of GsG has been implemented and prellminary experiments show promising results . ##A97-1011 A Non-Projective Dependency Parser . We describe a practical parser for unrestricted dependencies . The parser creates links between words and names the links according to their syntactic functions . We first describe the older Constraint Grammar parser where many of the ideas come from . Then we proceed to describe the central ideas of our new parser . Finally , the parser is evaluated . ##P97-1040 Efficient Generation In Primitive Optimality Theory . This paper introduces primitive Optimality Theory -LRB- OTP -RRB- , a linguistically motivated formalization of OT . OTP specifies the class of autosegmental representations , the universal generator Gen , and the two simple families of permissible constraints . In contrast to less restricted theories using Generalized Alignment , OTP 's optimal surface forms can be generated with finite-state methods adapted from -LRB- Ellison , 1994 -RRB- . Unfortunately these methods take time exponential on the size of the grammar . Indeed the generation problem is shown NP-complete in this sense . However , techniques are discussed for making Ellison 's approach fast in the typical case , including a simple trick that alone provides a 100-fold speedup on a grammar fragment of moderate size . One avenue for future improvements is a new finite-state notion , `` factored automata , '' where regular languages are represented compactly via formal intersections N ~ = IAi of FSAs . ##P08-1086 Distributed Word Clustering for Large Scale Class-Based Language Modeling in Machine Translation . In statistical language modeling , one technique to reduce the problematic effects of data sparsity is to partition the vocabulary into equivalence classes . In this paper we investigate the effects of applying such a technique to higherorder n-gram models trained on large corpora . We introduce a modification of the exchange clustering algorithm with improved efficiency for certain partially class-based models and a distributed version of this algorithm to efficiently obtain automatic word classifications for large vocabularies -LRB- -RRB- 1 million words -RRB- using such large training corpora -LRB- -RRB- 30 billion tokens -RRB- . The resulting clusterings are then used in training partially class-based language models . We show that combining them with wordbased n-gram models in the log-linear model of a state-of-the-art statistical machine translation system leads to improvements in translation quality as indicated by the BLEU score . ##W01-1413 Using The Web As A Bilingual Dictionary . We present a system for extracting an English translation of a given Japanese technical term by collecting and scoring translation candidates from the web . We first show that there are a lot of partially bilingual documents in the web that could be useful for term translation , discovered by using a commercial technical term dictionary and an Internet search engine . We then present an algorithm for obtaining translation candidates based on the distance of Japanese and English terms in web documents , and report the results of a preliminary experiment . ##N03-2022 Semantic Extraction With Wide-Coverage Lexical Resources . We report on results of combining graphical modeling techniques with Information Extraction resources -LRB- Pattern Dictionary and Lexicon -RRB- for both frame and semantic role assignment . Our approach demonstrates the use of two human built knowledge bases -LRB- WordNet and FrameNet -RRB- for the task of semantic extraction . ##D09-1053 Model Adaptation via Model Interpolation and Boosting for Web Search Ranking . This paper explores two classes of model adaptation methods for Web search ranking : Model Interpolation and error-driven learning approaches based on a boosting algorithm . The results show that model interpolation , though simple , achieves the best results on all the open test sets where the test data is very different from the training data . The tree-based boosting algorithm achieves the best performance on most of the closed test sets where the test data and the training data are similar , but its performance drops significantly on the open test sets due to the instability of trees . Several methods are explored to improve the robustness of the algorithm , with limited success . ##J03-1006 Weighted Deductive Parsing And Knuth 's Algorithm . We discuss weighted deductive parsing and consider the problem of finding the derivation with the lowest weight . We show that Knuth 's generalization of Dijkstra 's algorithm for the shortestpath problem offers a general method to solve this problem . Our approach is modular in the sense that Knuth 's algorithm is formulated independently from the weighted deduction system . ##W06-1646 Corrective Models For Speech Recognition Of Inflected Languages . This paper presents a corrective model for speech recognition of inflected languages . The model , based on a discriminative framework , incorporates word ngrams features as well as factored morphological features , providing error reduction over the model based solely on word n-gram features . Experiments on a large vocabulary task , namely the Czech portion of the MALACH corpus , demonstrate performance gain of about 1.1 -- 1.5 % absolute in word error rate , wherein morphological features contribute about a third of the improvement . A simple feature selection mechanism based on χ2 statistics is shown to be effective in reducing the number of features by about 70 % without any loss in performance , making it feasible to explore yet larger feature spaces . ##P98-1032 Automated Scoring Using A Hybrid Feature Identification Technique . This study exploits statistical redundancy inherent in natural language to automatically predict scores for essays . We use a hybrid feature identification method , including syntactic structure analysis , rhetorical structure analysis , and topical analysis , to score essay responses from test-takers of the Graduate Management Admissions Test -LRB- GMAT -RRB- and the Test of Written English -LRB- TWE -RRB- . For each essay question , a stepwise linear regression analysis is run on a training set -LRB- sample of human scored essay responses -RRB- to extract a weighted set of predictive features for each test question . Score prediction for cross-validation sets is calculated from the set of predictive features . Exact or adjacent agreement between the Electronic Essay Rater -LRB- e-rater -RRB- score predictions and human rater scores ranged from 87 % to 94 % across the 15 test questions . ##W09-2010 An Unsupervised Model for Text Message Normalization . Cell phone text messaging users express themselves briefly and colloquially using a variety of creative forms . We analyze a sample of creative , non-standard text message word forms to determine frequent word formation processes in texting language . Drawing on these observations , we construct an unsupervised noisy-channel model for text message normalization . On a test set of 303 text message forms that differ from their standard form , our model achieves 59 % accuracy , which is on par with the best supervised results reported on this dataset . ##P09-1077 Automatic sense prediction for implicit discourse relations in text . We present a series of experiments on automatically identifying the sense of implicit discourse relations , i.e. relations that are not marked with a discourse connective such as `` but '' or `` because '' . We work with a corpus of implicit relations present in newspaper text and report results on a test set that is representative of the naturally occurring distribution of senses . We use several linguistically informed features , including polarity tags , Levin verb classes , length of verb phrases , modality , context , and lexical features . In addition , we revisit past approaches using lexical pairs from unannotated text as features , explain some of their shortcomings and propose modifications . Our best combination of features outperforms the baseline from data intensive approaches by 4 % for comparison and 16 % for contingency . ##W98-0602 A Dynamic Temporal Logic Of Events , Intervals And States For Nominalization In Natural Language . The interpretation of nominalized expressions in English poses several problems . First , it must be explained how their meanings are derived from the meanings of the underlying verbs . Second , different forms of nominalizations differ in their semantic behavior . Finally , aspectual restrictions which exist for ingofnominals must be explained . The solution to be proposed is based on the assumption that non-stative verbs denote changes . Changes can be conceived of in two different ways , either as objects which bring about a particular result or as relations between states . A dynamic structure of events , intervals and states is defined in which both perspectives can be expressed by means of sorting the universe D. The basic idea is to augment a transition system for Dynamic Logic -LRB- DL -RRB- by a further domain of events such that programs from DL can be described either as objects or relations between states . The interpretation of verbs is based on the second perspective : they denote -LRB- generalized -RRB- relations between states . The interpretation of nominalized expressions uses the first perspective : they denote changes as objects . Different forms of nominalizations denote different sorts of objects which are systematically related to the denotation of the underlying verb . ##I08-2088 Method of Selecting Training Data to Build a Compact and Efficient Translation Model . Target task matched parallel corpora are required for statistical translation model training . However , training corpora sometimes include both target task matched and unmatched sentences . In such a case , training set selection can reduce the size of the translation model . In this paper , we propose a training set selection method for translation model training using linear translation model interpolation and a language model technique . According to the experimental results , the proposed method reduces the translation model size by 50 % and improves BLEU score by 1.76 % in comparison with a baseline training corpus usage . ##C04-1153 Learning Greek Verb Complements : Addressing The Class Imbalance . Imbalanced training sets , where one class is heavily underrepresented compared to the others , have a bad effect on the classification of rare class instances . We apply One-sided Sampling for the first time to a lexical acquisition task -LRB- learning verb complements from Modern Greek corpora -RRB- to remove redundant and misleading training examples of verb nondependents and thereby balance our training set . We experiment with well-known learning algorithms to classify new examples . Performance improves up to 22 % in recall and 15 % in precision after balancing the dataset 1 . ##H93-1034 Efficient Collaborative Discourse : A Theory And Its Implementation . An architecture for voice dialogue machines is described with emphasis on the problem solving and high level decision making mechanisms . The architecture provides facilities for generating voice interactions aimed at cooperative human-machine problem solving . It assumes that the dialogue will consist of a series of local selfconsistent subdialogues each aimed at subgoals related to the overall task . The discourse may consist of a set of such subdiaiogues with jumps from one subdialogue to the other in a search for a successful conclusion . The architecture maintains a user model to assure that interactions properly account for the level of competence of the user , and it includes an ability for the machine to take the initiative or yield the initiative to the user . It uses expectation from the dialogue processor to aid in the correction of errors from the speech recognizer . ##W00-0702 Corpus-Based Grammar Specialization . Broad-coverage grammars tend to be highly ambiguous . When such grammars are used in a restricted domain , it may be desirable to specialize them , in effect trading some coverage for a reduction in ambiguity . Grammar specialization is here given a novel formulation as an optimization problem , in which the search is guided by a global measure combining coverage , ambiguity and grammar size . The method , applicable to any unification grammar with a phrasestructure backbone , is shown to be effective in specializing a broad-coverage LFG for French . ##W08-0303 Discriminative Word Alignment via Alignment Matrix Modeling . In this paper a new discriminative word alignment method is presented . This approach models directly the alignment matrix by a conditional random field -LRB- CRF -RRB- and so no restrictions to the alignments have to be made . Furthermore , it is easy to add features and so all available information can be used . Since the structure of the CRFs can get complex , the inference can only be done approximately and the standard algorithms had to be adapted . In addition , different methods to train the model have been developed . Using this approach the alignment quality could be improved by up to 23 percent for 3 different language pairs compared to a combination of both IBM4alignments . Furthermore the word alignment was used to generate new phrase tables . These could improve the translation quality significantly . ##N03-3006 A Low-Complexity , Broad-Coverage Probabilistic Dependency Parser For English . Large-scale parsing is still a complex and timeconsuming process , often so much that it is infeasible in real-world applications . The parsing system described here addresses this problem by combining finite-state approaches , statistical parsing techniques and engineering knowledge , thus keeping parsing complexity as low as possible at the cost of a slight decrease in performance . The parser is robust and fast and at the same time based on strong linguistic foundations . ##P84-1024 Semantic Interpretation Using KL-One . This paper presents extensions to the work of Bobrow and Webber -LRB- Bobrow & Webber 80a , Bobrow & Webber 80b -RRB- on semantic interpretation using KL-ONE to represent knowledge . The approach is based on an extended case frame formalism applicable to all types of phrases , not just clauses . The frames are used to recognize semantically acceptable phrases , identify their structure , and , relate them to their meaning representation through translation rules . Approaches are presented for generating KL-ONE structures as the meaning of a sentence , for capturing semantic generalizations through abstract case frames , and for handling pronouns and relative clauses . ##W06-1322 A Computational Model Of Multi-Modal Grounding For Human Robot Interaction . Dialog systems for mobile robots operating in the real world should enable mixedinitiative dialog style , handle multi-modal information involved in the communication and be relatively independent of the domain knowledge . Most dialog systems developed for mobile robots today , however , are often system-oriented and have limited capabilities . We present an agentbased dialog model that are specially designed for human-robot interaction and provide evidence for its efficiency with our implemented system . ##H91-1051 Context Dependent Modeling Of Phones In Continuous Speech Using Decision Trees . In a continuous speech recognition system it is important to model the context dependent variations in the pronunciations of words . In this paper we present an automatic method for modeling phonological variation using decision trees . For each phone we construct a decision tree that specifies the acoustic realization of the phone as a function of the context in which it appears . Several thousand sentences from a natural language corpus spoken by several talkers are used to construct these decision trees . Experimental results on a 5000-word vocabulary natural language speech recognition task are presented . ##P06-3012 Focus To Emphasize Tone Structures For Prosodic Analysis In Spoken Language Generation . We analyze the concept of focus in speech and the relationship between focus and speech acts for prosodic generation . We determinehowthespeaker 's utterancesare influenced by speaker 's intention . The relationship between speech acts and focus information is used to define which parts of the sentence serve as the focus parts . We propose the Focus to Emphasize Tones -LRB- FET -RRB- structure to analyze the focus components . We also design the FET grammar to analyze the intonation patterns and produce tone marks as a result of our analysis . We present a proof-of-the-concept working example to validate our proposal . More comprehensive evaluations are part of our current work . ##C90-2003 Finding Translation Equivalents : An Application Of Grammatical Metaphor . In this paper I describe how a significant class of cases that would involve -LRB- possibly complex -RRB- structural transfer in nmchine translation can be handled avoiding transfer . This is achieved by applying a semantic organization developed for monolingual text generation that is sufficiently abstract to remain invariant , within theoretically specifiable limits , across different languages . The further application of a mechanism motivated from within monolingual text generation , ` grammatical metaphor ' , then allows candidate appropriate translations to be isolated . The incorporation of these essentially monolingual mechanisms within the machine translation process promises to significantly improve translational capabilities ; examples of this are presented for English and German . ##P96-1022 SEMHE A Generalised Two-Level System . This paper presents a generalised twolevel implementation which can handle linear and non-linear morphological operations . An algorithm for the interpretation of multi-tape two-level rules is described . In addition , a number of issues which arise when developing non-linear grammars are discussed with examples from Syriac . ##A00-1020 Multilingual Coreference Resolution . In this paper we present a new , multilingual data-driven method for coreference resolution as implemented in the SWIZZLE system . The results obtained after training this system on a bilingual corpus of English and Romanian tagged texts , outperformed coreference resolution in each of the individual languages . ##P97-1036 Unification-Based Multimodal Integration . Recent empirical research has shown conclusive advantages of multimodal interaction over speech-only interaction for mapbased tasks . This paper describes a multimodal language processing architecture which supports interfaces allowing simultaneous input from speech and gesture recognition . Integration of spoken and gestural input is driven by unification of typed feature structures representing the semantic contributions of the different modes . This integration method allows the component modalities to mutually compensate for each others ' errors . It is implemented in QuickSet , a multimodal -LRB- pen\/voice -RRB- system that enables users to set up and control distributed interactive simulations . ##P02-1025 A Study On Richer Syntactic Dependencies For Structured Language Modeling . We study the impact of richer syntactic dependencies on the performance of the structured language model -LRB- SLM -RRB- along three dimensions : parsing accuracy -LRB- LP\/LR -RRB- , perplexity -LRB- PPL -RRB- and worderror-rate -LRB- WER , N-best re-scoring -RRB- . We show that our models achieve an improvement in LP\/LR , PPL and\/or WER over the reported baseline results using the SLM on the UPenn Treebank and Wall Street Journal -LRB- WSJ -RRB- corpora , respectively . Analysis of parsing performance shows correlation between the quality of the parser -LRB- as measured by precision\/recall -RRB- and the language model performance -LRB- PPL and WER -RRB- . A remarkable fact is that the enriched SLM outperforms the baseline 3-gram model in terms of WER by 10 % when used in isolation as a second pass -LRB- N-best re-scoring -RRB- language model . ##W05-0710 Classifying Amharic News Text Using Self-Organizing Maps . The paper addresses using artificial neural networks for classification of Amharic news items . Amharic is the language for countrywide communication in Ethiopia and has its own writing system containing extensive systematic redundancy . It is quite dialectally diversified and probably representative of the languages of a continent that so far has received little attention within the language processing field . The experiments investigated document clustering around user queries using SelfOrganizing Maps , an unsupervised learning neural network strategy . The best ANN model showed a precision of 60.0 % when trying to cluster unseen data , and a 69.5 % precision when trying to classify it . ##H94-1083 Advanced Human-Computer Interface And Voice Processing Applications In Space . Much interest already exists in the electronics research community for developing and integrating speech technology to a variety of applications , ranging from voice-activated systems to automatic telephone transactions . This interest is particularly true in the field of aerospace where the training and operational demands on the crew have significantly increased with the proliferation of technology . Indeed , with advances in vehicule and robot automation , the role of the human operator has evolved from that of pilot\/driver and manual controller to supervisor and decision maker . Lately , some effort has been expended to implement alternative modes of system control , but automatic speech recognition -LRB- ASR -RRB- and human-computer interaction -LRB- HCI -RRB- research have only recently extended to civilian aviation and space applications . The purpose of this paper is to present the particularities of operator-computer interaction in the unique conditions found in space . The potential for voice control applications inside spacecraft is outlined and methods of integrating spoken-language interfaces onto operational space systems are suggested . ##W98-1412 Abductive Reasoning For Syntactic Realization . Abductive reasoning is • used in a bidirectional framework for syntactic realization and semantic interpretation . The use of the framework is illustrated in a case study of sentence generation , where different syntactic forms are generated depending on the status of discourse information . Examples are given involving three differen t syntactic constructions in German root clauses . ##P98-1010 A Memory-Based Approach to Learning Shallow Natural Language Patterns . Recognizing shallow linguistic patterns , such as basic syntactic relationships between words , is a common task in applied natural language and text processing . The common practice for approaching this task is by tedious manual definition of possible pattern structures , often in the form of regular expressions or finite automata . This paper presents a novel memory-based learning method that recognizes shallow patterns in new text based on a bracketed training corpus . The training data are stored as-is , in efficient suffix-tree data structures . Generalization is performed on-line at recognition time by comparing subsequences of the new text to positive and negative evidence in the corpus . This way , no information in the training is lost , as can happen in other learning systems that construct a single generalized model at the time of training . The paper presents experimental results for recognizing noun phrase , subject-verb and verb-object patterns in English . Since the learning approach enables easy porting to new domains , we plan to apply it to syntactic patterns in other languages and to sub-language patterns for information extraction . ##W06-1630 Unsupervised Named Entity Transliteration Using Temporal And Phonetic Correlation . In this paper we investigate unsupervised name transliteration using comparable corpora , corpora where texts in the two languages deal in some of the same topics -- and therefore share references to named entities -- but are not translations of each other . We present two distinct methods for transliteration , one approach using an unsupervised phonetic transliteration method , and the other using the temporal distribution of candidate pairs . Each of these approaches works quite well , but by combining the approaches one can achieve even better results . We believe that the novelty of our approach lies in the phonetic-based scoring method , which is based on a combination of carefully crafted phonetic features , and empirical results from the pronunciation errors of second-language learners of English . Unlike previous approaches to transliteration , this method can in principle work with any pair of languages in the absence of a training dictionary , provided one has an estimate of the pronunciation of words in text . ##J98-3005 Generating Natural Language Summaries From Multiple On-Line Sources . We present a methodology for summarization of news about current events in the form of briefings that include appropriate background -LRB- historical -RRB- information . The system that we developed , SUMMONS , uses the output of systems developed for the DARPA Message Understanding Conferences to generate summaries of multiple documents on the same or related events , presenting similarities and differences , contradictions , and generalizations among sources of information . We describe the various components of the system , showing how information from multiple articles is combined , organized into a paragraph , and finally , realized as English sentences . A feature of our work is the extraction of descriptions of entities such as people and places for reuse to enhance a briefing . ##W04-1801 A Lexico-Semantic Approach To The Structuring Of Terminology . This paper discusses a number of implications of using either a conceptual approach or a lexico-semantic approach to terminology structuring , especially for interpreting data supplied by corpora for the purpose of building specialized dictionaries . A simple example , i.e. , program , will serve as a basis for showing how relationships between terms are captured in both approaches . My aim is to demonstrate that truly conceptual approaches do not allow a flexible integration of terms and relationships between terms and that lexico-semantic approaches are more compatible with data gathered from corpora . I will also discuss some of the implications these approaches have for computational terminology and other corpus-based terminological endeavors . ##W07-2102 UTD-SRL : A Pipeline Architecture for Extracting Frame Semantic Structures . This paper describes our system for the task of extracting frame semantic structures in SemEval 2007 . The system architecture uses two types of learning models in each part of the task : Support Vector Machines -LRB- SVM -RRB- and Maximum Entropy -LRB- ME -RRB- . Designed as a pipeline of classi ers , the semantic parsing system obtained competitive precision scores on the test data . ##A92-1029 Compound Nouns In A Unification-Based MT System . This paper describes an approach to the treatment of nominal compounds in a machine translation project employing a modern unification-based system . General problems connected with the analysis of compounds are briefly reviewed , and the project , for the automatic translation of Swiss avalanche bulletins , is introduced . Avalanche bulletins deal with a limited semantic domain and employ a sublanguage in which nominal compounds occur frequently . These and other properties of the texts affect the treatment of compounds , permitting certain simplifications , while leaving a number of possible alternative analyses . We discuss the different problems involving the translation of compounds between German and French , and show how the computational environment in use permits two different approaches to the problem : an interlingua-based approach and a transfer-based approach . Finally , wc evaluate these approaches with respect to linguistic and computational considerations applicable in a MT-system dealing with a limited semantic domain and describe the solution that has actually been implemented . ##J93-2006 Coping With Ambiguity And Unknown Words Through Probabilistic Models . From spring 1990 through fall 1991 , we performed a battery of small experiments to test the effectiveness of supplementing knowledge-based techniques with probabilistic models . This paper reports our experiments in predicting parts of speech of highly ambiguous words , predicting the intended interpretation of an utterance when more than one interpretation satisfies all known syntactic and semantic constraints , and learning case frame information for verbs from example uses . From these experiments , we are convinced that probabilistic models based on annotated corpora can effectively reduce the ambiguity in processing text and can be used to acquire lexical information from a corpu> , by supplementing knowledge-based techniques . Based on the results of those experiments , we have constructed a new natural language system -LRB- PLUM -RRB- for extracting data from text , e.g. , newswire text . ##N04-1007 Answering Definition Questions With Multiple Knowledge Sources . Definition questions represent a largely unexplored area of question answering -- they are different from factoid questions in that the goal is to return as many relevant `` nuggets '' of information about a concept as possible . We describe a multi-strategy approach to answering such questions using a database constructed offline with surface patterns , a Webbased dictionary , and an off-the-shelf document retriever . Results are presented from component-level evaluation and from an endto-end evaluation of our implemented system at the TREC 2003 Question Answering Track . ##D09-1093 Using the Web for Language Independent Spellchecking and Autocorrection . We have designed , implemented and evaluated an end-to-end system spellchecking and autocorrection system that does not require any manually annotated training data . The World Wide Web is used as a large noisy corpus from which we infer knowledge about misspellings and word usage . This is used to build an error model and an n-gram language model . A small secondary set of news texts with artificially inserted misspellings are used to tune confidence classifiers . Because no manual annotation is required , our system can easily be instantiated for new languages . When evaluated on human typed data with real misspellings in English and German , our web-based systems outperform baselines which use candidate corrections based on hand-curated dictionaries . Our system achieves 3.8 % total error rate in English . We show similar improvements in preliminary results on artificial data for Russian and Arabic . ##C92-2070 Word-Sense Disambiguation Using Statistical Models Of Roget 's Categories Trained On Large Corpora . This paper describes a program that disambignates English word senses in unrestricted text using statistical models of the major Roget 's Thesaurus categories . Roget 's categories serve as approximations of conceptual classes . The categories listed for a word in Roger 's index tend to correspond to sense distinctions ; thus selecting the most likely category provides a useful level of sense disambiguatiou . The selection of categories is accomplished by identifying and weighting words that are indicative of each category when seen in context , using a Bayesian theoretical framework . Other statistical approaches have required special corpora or hand-labeled training examples for much of the lexicon . Our use of class models overcomes this knowledge acquisition bottleneck , enabling training on unresUicted monolingual text without human intervention . Applied to the 10 million word Grolier 's Encyclopedia , the system correctly disambiguated 92 % of the instances of 12 polysemous words that have been previously studied in the literature . ##W04-2403 A Semantic Kernel For Predicate Argument Classification . Automatically deriving semantic structures from text is a challenging task for machine learning . The flat feature representations , usually used in learning models , can only partially describe structured data . This makes difficult the processing of the semantic information that is embedded into parse-trees . In this paper a new kernel for automatic classification of predicate argumentshas been designed and experimented . It is based on subparse-trees annotated with predicate argument information from PropBank corpus . This kernel , exploiting the convolution properties of the parse-tree kernel , enables us to learn which syntactic structures can be associated with the arguments defined in PropBank . Support Vector Machines -LRB- SVMs -RRB- using such a kernel classify arguments with a better accuracy than SVMs based on linear kernel . ##W06-3918 The Alligator theorem prover for dependent type systems : Description and proof samples . This paper introduces the Alligator theorem prover for Dependent Type Systems -LRB- dts -RRB- . We start with highlighting a number of properties of dts that make them specifically suited for computational semantics . We then briefly introduce dts and our implementation . The paper concludes with an example of a dts proof that illustrates the suitability of dts for modelling anaphora resolution . ##W00-1413 The Hyperonym Problem Revisited : Conceptual And Lexical Hierarchies In Language Generation . When a lexical item is selected in the language production process , it needs to be explained why none of its superordinates gets selected instead , since their applicability conditions are fulfilled all the same . This question has received much attention in cognitive modelling and not as much in other branches of NLG . This paper describes the various approaches taken , discusses the reasons why they are so different , and argues that production models using symbolic representations should make a distinction between conceptual and lexical hierarchies , which can be organized along fixed levels as studied in -LRB- some branches of -RRB- lexical semantics . ##J04-3002 Learning Subjective Language . Subjectivity in natural language refers to aspects of language used to express opinions , evaluations , and speculations . There are numerous natural language processing applications for which subjectivity analysis is relevant , including information extraction and text categorization . The goal of this work is learning subjective language from corpora . Clues of subjectivity are generated and tested , including low-frequency words , collocations , and adjectives and verbs identified using distributional similarity . The features are also examined working together in concert . The features , generated from different data sets using different procedures , exhibit consistency in performance in that they all do better and worse on the same data sets . In addition , this article shows that the density of subjectivity clues in the surrounding context strongly affects how likely it is that a word is subjective , and it provides the results of an annotation study assessing the subjectivity of sentences with high-density features . Finally , the clues are used to perform opinion piece recognition -LRB- a type of text categorization and genre detection -RRB- to demonstrate the utility of the knowledge acquired in this article . ##E89-1034 French Order Without Order . To account for the semi-free word order of French , Unification Categorial Grammar is extended in two ways . First , verbal valencies are contained in a set rather than in a list . Second , type-raised NP 's are described as two-sided functors . The new framework does not overgenerate i.e. , it accepts all and only the sentences which are grammatical . This follows partly from the elimination of false lexical ambiguities - i.e. , ambiguities introduced in order to account for all the possible positions a word can be in within a sentence - and partly from a system of features constraining the possible combinations . ##E93-1013 LFG Semantics Via Constraints . Semantic theories of natural language associate meanings with utterances by providing meanings for lexical items and rules for determining the meaning of larger units given the meanings of their parts . Traditionally , meanings are combined via function composition , which works well when constituent structure trees are used to guide semantic composition . More recently , the functional structure of LFG has been used to provide the syntactic information necessary for constraining derivations of meaning in a cross-linguistically uniform format . It has been difficult , however , to reconcile this approach with the combination of meanings by function composition . In contrast to compositional approaches , we present a deductive approach to assembling meanings , based on reasoning with constraints , which meshes well with the unordered nature of information in the functional structure . Our use of linear logic as a ` glue ' for assembling meanings also allows for a coherent treatment of modification as well as of the LFG requirements of completeness and coherence . ##C00-2087 Interaction Grammars . Interaction Grammars -LRB- IG -RRB- are a new linguistic formalism which is based on descriptions of under ~ specified trees in the fl ` amework of intuitionistic linear logic -LRB- ILL -RRB- . Syntactic composition , which is expressed by deduction in linear logic , is controlled by a system of polarized features . In this way , parsing amounts to generating models of tree descriptions and it is implemented as a constraint satisfaction problem . ##N07-3001 Query Expansion Using Domain Information in Compounds . This paper describes a query expansion strategy for domain specific information retrieval . Components of compounds are used selectively . Only parts belonging to the same domain as the compound itself will be used in expanded queries . ##P05-3014 SenseLearner : Word Sense Disambiguation For All Words In Unrestricted Text . This paper describes SENSELEARNER -- a minimally supervised word sense disambiguation system that attempts to disambiguate all content words in a text using WordNet senses . We evaluate the accuracy of SENSELEARNER on several standard sense-annotated data sets , and show that it compares favorably with the best results reported during the recent SENSEVAL evaluations . ##W93-0227 Rhetoric As Knowledge . A proper assessment of the relation between discourse structure and speaker 's communicative intentions requires a better understanding of communicative intentions . This contribution proposes that there is a crucial difference between intending the hearer to entertain a certain belief -LRB- or desire , or intention -RRB- , and intending to affect the strength with which the hearer entertains the belief -LRB- or desire , or intention -RRB- . Rhetoric , if defined as a body of knowledge about how discourse structure affects the strength with which a discourse participant entertains beliefs , desires , and intentions , can be seen to play a precise and crucial role in the planning of discourse . ##P94-1016 Interleaving Syntax And Semantics In An Efficient Bottom-Up Parser . We describe an efficient bottom-up parser that interleaves syntactic and semantic structure building . Two techniques are presented for reducing search by reducing local ambiguity : Limited leftcontext constraints are used to reduce local syntactic ambiguity , and deferred sortal-constraint application is used to reduce local semantic ambiguity . We experimentally evaluate these techniques , and show dramatic reductions in both number of chart edges and total parsing time . The robust processing capabilities of the parser are demonstrated in its use in improving the accuracy of a speech recognizer . ##W04-0849 Class-Based Collocations For Word Sense Disambiguation . This paper describes the NMSU-Pitt-UNCA word-sense disambiguation system participating in the Senseval-3 English lexical sample task . The focus of the work is on using semantic class-based collocations to augment traditional word-based collocations . Three separate sources of word relatedness are used for these collocations : 1 -RRB- WordNet hypernym relations ; 2 -RRB- cluster-based word similarity classes ; and 3 -RRB- dictionary definition analysis . ##I08-4015 The Character-based CRF Segmenter of MSRA&NEU for the 4th Bakeoff . This paper describes the Chinese Word Segmenter for the fourth International Chinese Language Processing Bakeoff . Base on Conditional Random Field -LRB- CRF -RRB- model , a basic segmenter is designed as a problem of character-based tagging . To further improve the performance of our segmenter , we employ a word-based approach to increase the in-vocabulary -LRB- IV -RRB- word recall and a post-processing to increase the out-of-vocabulary -LRB- OOV -RRB- word recall . We participate in the word segmentation closed test on all five corpora and our system achieved four second best and one the fifth in all the five corpora . ##N09-1018 Jointly Identifying Predicates , Arguments and Senses using Markov Logic . In this paper we present a Markov Logic Network for Semantic Role Labelling that jointly performs predicate identification , frame disambiguation , argument identification and argument classification for all predicates in a sentence . Empirically we find that our approach is competitive : our best model would appear on par with the best entry in the CoNLL 2008 shared task open track , and at the 4th place of the closed track -- right behind the systems that use significantly better parsers to generate their input features . Moreover , we observe that by fully capturing the complete SRL pipeline in a single probabilistic model we can achieve significant improvements over more isolated systems , in particular for out-of-domain data . Finally , we show that despite the joint approach , our system is still efficient . ##D09-1068 Improving Web Search Relevance with Semantic Features . Most existing information retrieval -LRB- IR -RRB- systems do not take much advantage of natural language processing -LRB- NLP -RRB- techniques due to the complexity and limited observed effectiveness of applying NLP to IR . In this paper , we demonstrate that substantial gains can be obtained over a strong baseline using NLP techniques , if properly handled . We propose a framework for deriving semantic text matching features from named entities identified in Web queries ; we then utilize these features in a supervised machine-learned ranking approach , applying a set of emerging machine learning techniques . Our approach is especially useful for queries that contain multiple types of concepts . Comparing to a major commercial Web search engine , we observe a substantial 4 % DCG5 gain over the affected queries . ##P09-4005 A Web-Based Interactive Computer Aided Translation Tool . We developed caitra , a novel tool that aids human translators by -LRB- a -RRB- making suggestions for sentence completion in an interactive machine translation setting , -LRB- b -RRB- providing alternative word and phrase translations , and -LRB- c -RRB- allowing them to postedit machine translation output . The tool uses the Moses decoder , is implemented in Ruby on Rails and C + + and delivered over the web . ##W03-1901 Outline Of The International Standard Linguistic Annotation Framework . This paper describes the outline of a linguistic annotation framework under development by ISO TC37 SC WG1-1 . This international standard provides an architecture for the creation , annotation , and manipulation of linguistic resources and processing software . The goal is to provide maximum flexibility for encoders and annotators , while at the same time enabling interchange and re-use of annotated linguistic resources . We describe here the outline of the standard for the purposes of enabling annotators to begin to explore how their schemes may map into the framework . ##N07-1035 Estimating the Reliability of MDP Policies : a Confidence Interval Approach . Past approaches for using reinforcement learning to derive dialog control policies have assumed that there was enough collected data to derive a reliable policy . In this paper we present a methodology for numerically constructing con dence intervals for the expected cumulative reward for a learned policy . These intervals are used to -LRB- 1 -RRB- better assess the reliability of the expected cumulative reward , and -LRB- 2 -RRB- perform a re ned comparison between policies derived from different Markov Decision Processes -LRB- MDP -RRB- models . We applied this methodology to a prior experiment where the goal was to select the best features to include in the MDP statespace . Our results show that while some of the policies developed in the prior work exhibited very large con dence intervals , the policy developed from the best feature set had a much smaller con dence interval and thus showed very high reliability . ##W03-1010 A Plethora Of Methods For Learning English Countability . This paper compares a range of methods for classifying words based on linguistic diagnostics , focusing on the task of learning countabilities for English nouns . We propose two basic approaches to feature representation : distribution-based representation , which simply looks at the distribution of features in the corpus data , and agreement-based representation which analyses the level of tokenwise agreement between multiple preprocessor systems . We additionally compare a single multiclass classifier architecture with a suite of binary classifiers , and combine analyses from multiple preprocessors . Finally , we present and evaluate a feature selection method . ##H05-1113 Measuring The Relative Compositionality Of Verb-Noun -LRB- V-N -RRB- Collocations By Integrating Features . Measuring the relative compositionality of Multi-word Expressions -LRB- MWEs -RRB- is crucial to Natural Language Processing . Various collocation based measures have been proposed to compute the relative compositionality of MWEs . In this paper , we define novel measures -LRB- both collocation based and context based measures -RRB- to measure the relative compositionality of MWEs of V-N type . We show that the correlation of these features with the human ranking is much superior to the correlation of the traditional features with the human ranking . We then integrate the proposed features and the traditional features using a SVM based ranking function to rank the collocations of V-N type based on their relative compositionality . We then show that the correlation between the ranks computed by the SVM based ranking function and human ranking is significantly better than the correlation between ranking of individual features and human ranking . ##W05-1518 Improving Parsing Accuracy By Combining Diverse Dependency Parsers . This paper explores the possibilities of improving parsing results by combining outputs of several parsers . To some extent , we are porting the ideas of Henderson and Brill -LRB- 1999 -RRB- to the world of dependency structures . We differ from them in exploring context features more deeply . All our experiments were conducted on Czech but the method is language-independent . We were able to significantly improve over the best parsing result for the given setting , known so far . Moreover , our experiments show that even parsers far below the state of the art can contribute to the total improvement . ##P01-1012 Detecting Problematic Turns In Human-Machine Interactions : Rule-Induction Versus Memory-Based Learning Approaches . We address the issue of on-line detection of communication problems in spoken dialogue systems . The usefulness is investigated of the sequence of system question types and the word graphs corresponding to the respective user utterances . By applying both ruleinduction and memory-based learning techniques to data obtained with a Dutch train time-table information system , the current paper demonstrates that the aforementioned features indeed lead to a method for problem detection that performs significantly above baseline . The results are interesting from a dialogue perspective since they employ features that are present in the majority of spoken dialogue systems and can be obtained with little or no computational overhead . The results are interesting from a machine learning perspective , since they show that the rule-based method performs significantly better than the memory-based method , because the former is better capable of representing interactions between features . ##E87-1020 REFTEX - A Context-Based Translation Aid . The system presented in this paper produces bilingual passages of text from an original -LRB- source -RRB- text and one -LRB- or more -RRB- of its translated versions . The source text passage includes words or word compounds which a translator wants to retrieve for the current translating of another text . The target text passage is the equivalent version of the source text passage . On the basis of a comparison of the contexts of these words in the concorded passage and his own text , the translator has to decide on the utility of the translation proposed in the target text passage . The program might become a component of translator 's work bench . ##W07-0204 Timestamped Graphs : Evolutionary Models of Text for Multi-Document Summarization . Current graph-based approaches to automatic text summarization , such as LexRank and TextRank , assume a static graph which does not model how the input texts emerge . A suitable evolutionary text graph model may impart a better understanding of the texts and improve the summarization process . We propose a timestamped graph -LRB- TSG -RRB- model that is motivated by human writing and reading processes , and show how text units in this model emerge over time . In our model , the graphs used by LexRank and TextRank are specific instances of our timestamped graph with particular parameter settings . We apply timestamped graphs on the standard DUC multi-document text summarization task and achieve comparable results to the state of the art . ##N04-1002 Cross-Document Coreference On A Large Scale Corpus . In this paper , we will compare and evaluate the effectiveness of different statistical methods in the task of cross-document coreference resolution . We created entity models for different test sets and compare the following disambiguation and clustering techniques to cluster the entity models in order to create coreference chains : Incremental Vector Space KL-Divergence Agglomerative Vector Space ##D07-1107 Learning to Merge Word Senses . It has been widely observed that different NLP applications require different sense granularities in order to best exploit word sense distinctions , and that for many applications WordNet senses are too fine-grained . In contrast to previously proposed automatic methods for sense clustering , we formulate sense merging as a supervised learning problem , exploiting human-labeled sense clusterings as training data . We train a discriminative classifier over a wide variety of features derived from WordNet structure , corpus-based evidence , and evidence from other lexical resources . Our learned similarity measure outperforms previously proposed automatic methods for sense clustering on the task of predicting human sense merging judgments , yielding an absolute F-score improvement of 4.1 % on nouns , 13.6 % on verbs , and 4.0 % on adjectives . Finally , we propose a model for clustering sense taxonomies using the outputs of our classifier , and we make available several automatically sense-clustered WordNets of various sense granularities . ##W04-2705 The NomBank Project : An Interim Report . This paper describes NomBank , a project that will provide argument structure for instances of common nouns in the Penn Treebank II corpus . NomBank is part of a larger effort to add additional layers of annotation to the Penn Treebank II corpus . The University of Pennsylvania 's PropBank , NomBank and other annotation projects taken together should lead to the creation of better tools for the automatic analysis of text . This paper describes the NomBank project in detail including its speci cations and the process involved in creating the resource . ##P99-1010 Supervised Grammar Induction Using Training Data With Limited Constituent Information . Corpus-based grammar induction generally relies on hand-parsed training data to learn the structure of the language . Unfortunately , the cost of building large annotated corpora is prohibitively expensive . This work aims to improve the induction strategy when there are few labels in the training data . We show that the most informative linguistic constituents are the higher nodes in the parse trees , typically denoting complex noun phrases and sentential clauses . They account for only 20 % of all constituents . For inducing grammars from sparsely labeled training data -LRB- e.g. , only higher-level constituent labels -RRB- , we propose an adaptation strategy , which produces grammars that parse almost as well as grammars induced from fully labeled corpora . Our results suggest that for a partial parser to replace human annotators , it must be able to automatically extract higher-level constituents rather than base noun phrases . ##W03-0613 Learning Word Meanings And Descriptive Parameter Spaces From Music . The audio bitstream in music encodes a high amount of statistical , acoustic , emotional and cultural information . But music also has an important linguistic accessory ; most musical artists are described in great detail in record reviews , fan sites and news items . We highlight current and ongoing research into extracting relevant features from audio and simultaneously learning language features linked to the music . We show results in a `` query-bydescription '' task in which we learn the perceptual meaning of automatically-discovered single-term descriptive components , as well as a method of automatically uncovering ` semantically attached ' terms -LRB- terms that have perceptual grounding . -RRB- We then show recent work in ` semantic basis functions ' -- parameter spaces of description -LRB- such as fast ... slow or male ... female -RRB- that encode the highest descriptive variance in a semantic space . ##P06-2029 The Benefit Of Stochastic PP Attachment To A Rule-Based Parser . To study PP attachment disambiguation as a benchmark for empirical methods in natural language processing it has often been reduced to a binary decision problem -LRB- between verb or noun attachment -RRB- in a particular syntactic configuration . A parser , however , must solve the more general task of deciding between more than two alternatives in many different contexts . We combine the attachment predictions made by a simple model of lexical attraction with a full-fledged parser of German to determine the actual benefit of the subtask to parsing . We show that the combination of data-driven and rule-based components can reduce the number of all parsing errors by 14 % and raise the attachment accuracy for dependency parsing of German to an unprecedented 92 % . ##P87-1027 The Derivation Of A Grammatically Indexed Lexicon From The Longman Dictionary Of Contemporary English . We describe a methodology and associated software system for the construction of a large lexicon from an existing machine-readable -LRB- published -RRB- dictionary . The lexicon serves as a component of an English morphological and syntactic analyesr and contains entries with grammatical definitions compatible with the word and sentence grammar employed by the analyser . We describe a software system with two integrated components . One of these is capable of extracting syntactically rich , theory-neutral lexical templates from a suitable machine-readabh source . The second supports interactive and semi-automatic generation and testing of target lexical entries in order to derive a sizeable , accurate and consistent lexicon from the source dictionary which contains partial -LRB- and occasionally inaccurate -RRB- information . Finally , we evaluate the utility of the Longman Dictionary of Contemporary EnglgsA as a suitable source dictionary for the target lexicon . ##C96-2182 Formal Description Of Multi-Word Lexemes With The Finite-State Formalism IDAREX . Most multi-word lexemes -LRB- MWLs -RRB- allow certain types of variation . This has to be taken into account for their description and their recognition in texts . We suggest to describe their syntactic restrictions and their idiosyncratic peculiarities with local grammar rules , which at the same time allow to express in a general way regularities valid for a whole class of MWLs . The local grammars can be written in a very convenient and compact way as regular expressions in the formalism IDAREX which uses a two-level morphology . IDAREX allows to define various types of variables , and to mix canonical and inflected word forms in the regular expressions . ##C08-1044 Modeling Chinese Documents with Topical Word-Character Models . As Chinese text is written without word boundaries , effectively recognizing Chinese words is like recognizing collocations in English , substituting characters for words and words for collocations . However , existing topical models that involve collocations have a common limitation . Instead of directly assigning a topic to a collocation , they take the topic of a word within the collocation as the topic of the whole collocation . This is unsatisfactory for topical modeling of Chinese documents . Thus , we propose a topical word-character model -LRB- TWC -RRB- , which allows two distinct types of topics : word topic and character topic . We evaluated TWC both qualitatively and quantitatively to show that it is a powerful and a promising topic model . ##W05-0803 Parsing Word-Aligned Parallel Corpora In A Grammar Induction Context . We present an Earley-style dynamic programming algorithm for parsing sentence pairs from a parallel corpus simultaneously , building up two phrase structure trees and a correspondence mapping between the nodes . The intended use of the algorithm is in bootstrapping grammars for less studied languages by using implicit grammatical information in parallel corpora . Therefore , we presuppose a given -LRB- statistical -RRB- word alignment underlying in the synchronous parsing task ; this leads to a significant reduction of the parsing complexity . The theoretical complexity results are corroborated by a quantitative evaluation in which we ran an implementation of the algorithm on a suite of test sentences from the Europarl parallel corpus . ##W00-1416 On Identifying Sets . A range of research has explored the problem of generating referring expressions that uniquely identify a single entity from the shared context . But what about expressions that identify sets of entities ? In this paper , I adapt recent semantic research on plural descriptions -- using covers to abstract collective and distributive readings and using sets of assignments to represent dependencies among references -- to describe a search problem for set-identifying expressions that largely mirrors the search problem for singular referring expressions . By structuring the search space only in terms of the words that can be added to the description , the proposal defuses potential combinatorial explosions that might otherwise arise with reference to sets . ##W07-2053 NUS-ML : Improving Word Sense Disambiguation Using Topic Features . We participated in SemEval-1 English coarse-grained all-words task -LRB- task 7 -RRB- , English fine-grained all-words task -LRB- task 17 , subtask 3 -RRB- and English coarse-grained lexical sample task -LRB- task 17 , subtask 1 -RRB- . The same method with different labeled data is used for the tasks ; SemCor is the labeled corpus used to train our system for the allwords tasks while the labeled corpus that is provided is used for the lexical sample task . The knowledge sources include part-of-speech of neighboring words , single words in the surrounding context , local collocations , and syntactic patterns . In addition , we constructed a topic feature , targeted to capture the global context information , using the latent dirichlet allocation -LRB- LDA -RRB- algorithm with unlabeled corpus . A modified na ¨ ıve Bayes classifier is constructed to incorporate all the features . We achieved 81.6 % , 57.6 % , 88.7 % for coarse-grained allwords task , fine-grained all-words task and coarse-grained lexical sample task respectively . ##W04-0202 COOPML : Towards Annotating Cooperative Discourse . In this paper , we present a preliminary version of COOPML , a language designed for annotating cooperative discourse . We investigate the different linguistic marks that identify and characterize the different forms of cooperativity found in written texts from FAQs , Forums and emails . ##C96-2129 Automatic Detection Of Omissions In Translations . ADOMIT is an algorithln for Automatic Detection of OMissions in Translations . The algorithm relies solely on geometric analysis of bitext maps and uses no linguistic information . This property allows it to deal equally well with omissions that do not correspond to linguistic units , such as might result ti'om word-processing mishaps . ADOMIT has proven itself by discovering many errors in a handconstructed gold standard for evaluating bitext mapping algorithms . Quantitative evaluation on simulated omissions showed that , even with today 's poor bitext mapping technology , ADOMIT is a valuable quality control tool for translators and translation bureaus . ##N07-2030 On using Articulatory Features for Discriminative Speaker Adaptation . This paper presents a way to perform speaker adaptation for automatic speech recognition using the stream weights in a multi-streamsetup , whichincludedacoustic models for `` Articulatory Features '' such as ROUNDED or VOICED . We present supervised speaker adaptation experiments on a spontaneous speech task and compare the above stream-based approach to conventional approaches , in which the models , and not stream combination weights , are being adapted . In the approach we present , stream weights model the importance of features such as VOICED for word discrimination , which offers a descriptive interpretation of the adaptation parameters . ##P06-2022 Automatically Extracting Nominal Mentions Of Events With A Bootstrapped Probabilistic Classifier . Most approaches to event extraction focus on mentions anchored in verbs . However , many mentions of events surface as noun phrases . Detecting them can increase the recall of event extraction and provide the foundation for detecting relations between events . This paper describes a weaklysupervised method for detecting nominal event mentions that combines techniques from word sense disambiguation -LRB- WSD -RRB- andlexicalacquisitiontocreateaclassifier thatlabelsnounphrasesasdenotingevents or non-events . The classifier uses bootstrapped probabilistic generative models of the contexts of events and non-events . Thecontextsarethelexically-anchoredsemantic dependency relations that the NPs appear in . Our method dramatically improves with bootstrapping , and comfortably outperforms lexical lookup methods whicharebasedonverymuchlargerhandcrafted resources . ##P84-1093 Machine-Readable Dictionaries . The papers in this panel consider machine-readable dictionaries from several perspectives : research in computational linguistics and computational lexicology , the development of tools for improving accessibility , the design of lexical reference systems for educational purposes , and applications of machine-readable dictionaries in information science contexts . As background and by way of introduction , a description is provided of a workshop on machine-readable dictionaries that was held at SRI International in April 1983 . ##W98-1111 Language Identification With Confidence Limits . A statistical classification algorithm and its application to language identification from noisy input are described . The main innovation is to compute confidence limits on the classification , so that the algorithm terminates when enough evidence to make a clear decision has been made , and so avoiding problems with categories that have similar characteristics . A second application , to genre identification , is briefly examined . The results show that some of the problems of other language identification techniques can be avoided , and illustrate a more important point : that a statistical language process can be used to provide feedback about its own success rate . ##W03-0404 Learning Subjective Nouns Using Extraction Pattern Bootstrapping . We explore the idea of creating a subjectivity classifier that uses lists of subjective nouns learned by bootstrapping algorithms . The goal of our research is to develop a system that can distinguish subjective sentences from objective sentences . First , we use two bootstrapping algorithms that exploit extraction patterns to learn sets of subjective nouns . Then we train a Naive Bayes classifier using the subjective nouns , discourse features , and subjectivity clues identified in prior research . The bootstrapping algorithms learned over 1000 subjective nouns , and the subjectivity classifier performed well , achieving 77 % recall with 81 % precision . ##P08-2063 Choosing Sense Distinctions for WSD : Psycholinguistic Evidence . Supervised word sense disambiguation requires training corpora that have been tagged with word senses , which begs the question of which word senses to tag with . The default choice has been WordNet , with its broad coverage and easy accessibility . However , concerns have been raised about the appropriateness of its fine-grained word senses for WSD . WSD systems have been far more successful in distinguishing coarsegrained senses than fine-grained ones -LRB- Navigli , 2006 -RRB- , but does that approach neglect necessary meaning differences ? Recent psycholinguistic evidence seems to indicate that closely related word senses may be represented in the mental lexicon much like a single sense , whereas distantly related senses may be represented more like discrete entities . These results suggest that , for the purposes of WSD , closely related word senses can be clustered together into a more general sense with little meaning loss . The current paper will describe this psycholinguistic research and its implications for automatic word sense disambiguation . ##W00-1212 A Block-Based Robust Dependency Parser For Unrestricted Chinese Text . Although substantial efforts have been made to parse Chinese , very few have been practically used due to incapability of handling unrestricted texts . This paper realizes a practical system for Chinese parsing by using a hybrid model of phrase structure partial parsing and dependency parsing . This system showed good performance and high robustness in parsing unrestricted texts and has been applied in a successful machine translation product . ##W04-2804 Ends-Based Dialogue Processing . We describe a reusable and scalable dialogue toolbox and its application in multiple systems . Our main claim is that ends-based representation and processing throughout the complete dialogue backbone it essential to our approach . ##I05-1049 Relative Compositionality of Multi-word Expressions : A Study of Verb-Noun -LRB- V-N -RRB- Collocations . Recognition of Multi-word Expressions -LRB- MWEs -RRB- and their relative compositionality are crucial to Natural Language Processing . Various statistical techniques have been proposed to recognize MWEs . In this paper , we integrate all the existing statistical features and investigate a range of classifiers for their suitability for recognizing the non-compositional Verb-Noun -LRB- V-N -RRB- collocations . In the task of ranking the V-N collocations based on their relative compositionality , we show that the correlation between the ranks computed by the classifier and human ranking is significantly better than the correlation between ranking of individual features and human ranking . We also show that the properties ` Distributed frequency of object ' -LRB- as defined in -LRB- 27 -RRB- -RRB- and ` Nearest Mutual Information ' -LRB- as adapted from -LRB- 18 -RRB- -RRB- contribute greatly to the recognition of the non-compositional MWEs of the V-N type and to the ranking of the V-N collocations based on their relative compositionality . ##W05-1507 Machine Translation As Lexicalized Parsing With Hooks . We adapt the `` hook '' trick for speeding up bilexical parsing to the decoding problem for machine translation models that are based on combining a synchronous context free grammar as the translation model with an n-gram language model . This dynamic programming technique yields lower complexity algorithms than have previously been described for an important class of translation models . ##T78-1031 Path-Based And Node-Based Inference In Semantic Networks . Two styles of performing inference in semantic networks are presented and compared . Path-based inference allows an arc or a path of arcs between two given nodes to be inferred from the existence of another specified path between the same two nodes . Path-based inference rules may be written using a binary relational calculus notation . Node-based inference allows a structure of nodes to be inferred from the existence of an instance of a pattern of node structures . Node-based inference rules can be constructed in a semantic network using a variant of a predicate calculus notation . Path-based inference is more efficient , while node-based inference is more general . A method is described of combining the two styles in a single system in order to take advantage of the strengths of each . Applications of path-based inference rules to the representation of the extensional equivalence of intensional concepts , and to the explication of inheritance in hierarchies are sketched . ##D08-1099 Automatic Set Expansion for List Question Answering . This paper explores the use of set expansion -LRB- SE -RRB- to improve question answering -LRB- QA -RRB- when the expected answer is a list of entities belonging to a certain class . Given a small set of seeds , SE algorithms mine textual resources to produce an extended list including additional members of the class represented by the seeds . We explore the hypothesis that a noise-resistant SE algorithm can be used to extend candidate answers produced by a QA system and generate a new list of answers that is better than the original list produced by the QA system . We further introduce a hybrid approach which combines the original answers from the QA system with the output from the SE algorithm . Experimental results for several state-of-the-art QA systems show that the hybrid system performs better than the QA systems alone when tested on list question data from past TREC evaluations . ##W97-0901 Reuse Of A Proper Noun Recognition System In Commercial And Operational NLP Applications . SRA 's proprietary product , NameTag TM , which provides fast and accurate name recognition , has been reused in many applications in recent and ongoing efforts , including multilingual information retrieval and browsing , text clustering , and assistance to manual text indexing . This paper reports on SRA 's experience in embedding name recognition in these three specific applications , and the mutual impacts that occur , both on the algorithmic level and in the role that name recognition plays in user interaction with a system . In the course of this , we touch upon various interactions between proper name recognition and machine translation -LRB- MT -RRB- , as well as the role of accurate name recognition in improving the performance of word segmentation algorithms needed for languages whose writing systems do not segment words . ##W08-1133 Referring Expression Generation Using Speaker-based Attribute Selection and Trainable Realization -LRB- ATTR -RRB- . In the first REG competition , researchers proposed several general-purpose algorithms for attribute selection for referring expression generation . However , most of this work did not take into account : a -RRB- stylistic differences between speakers ; or b -RRB- trainable surface realization approaches that combine semantic and word order information . In this paper we describe and evaluate several end-to-end referring expression generation algorithms that take into consideration speaker style and use data-driven surface realization techniques . ##P06-1053 Integrating Syntactic Priming Into An Incremental Probabilistic Parser, With An Application To Psycholinguistic Modeling . The psycholinguistic literature provides evidence for syntactic priming , i.e. , the tendency to repeat structures . This paper describes a method for incorporating priming into an incremental probabilistic parser . Three models are compared , which involve priming of rules between sentences , within sentences , and within coordinate structures . These models simulate the reading time advantage for parallel structures found in human data , and also yield a small increase in overall parsing accuracy . ##W07-1429 Biology Based Alignments of Paraphrases for Sentence Compression . 1 In this paper , we present a study for extracting and aligning paraphrases in the context of Sentence Compression . First , we justify the application of a new measure for the automatic extraction of paraphrase corpora . Second , we discuss the work done by -LRB- Barzilay & Lee , 2003 -RRB- who use clustering of paraphrases to induce rewriting rules . We will see , through classical visualization methodologies -LRB- Kruskal & Wish , 1977 -RRB- and exhaustive experiments , that clustering may not be the best approach for automatic pattern identification . Finally , we will provide some results of different biology based methodologies for pairwise paraphrase alignment . 1 Introduction Sentence Compression can be seen as the removal of redundant words or phrases from an input sentence by creating a new sentence in which the gist of the original meaning of the sentence remains unchanged . Sentence Compression takes an important place for Natural Language Processing -LRB- NLP -RRB- tasks where specific constraints must be satisfied , such as length in summarization -LRB- Barzilay & Lee , 2002 ; Knight & Marcu , 2002 ; Shinyama et al. , 2002 ; Barzilay & Lee , 2003 ; Le Nguyen & Ho , 2004 ; Unno et al. , 2006 -RRB- , style in text simplification -LRB- Marsi & Krahmer , 2005 -RRB- or sentence simplification for subtitling -LRB- Daelemans et al. , 2004 -RRB- . ##W01-1006 Semi-Automatic Practical Ontology Construction By Using A Thesaurus , Computational Dictionaries , And Large Corpora . This paper presents the semi-automatic construction method of a practical ontology by using various resources . In order to acquire a reasonably practical ontology in a limited time and with less manpower , we extend the Kadokawa thesaurus by inserting additional semantic relations into its hierarchy , which are classified as case relations and other semantic relations . The former can be obtained by converting valency information and case frames from previously-built computational dictionaries used in machine translation . The latter can be acquired from concept co-occurrence information , which is extracted automatically from large corpora . The ontology stores rich semantic constraints among 1,110 concepts , and enables a natural language processing system to resolve semantic ambiguities by making inferences with the concept network of the ontology . In our practical machine translation system , our ontology-based word sense disambiguation method achieved an 8.7 % improvement over methods which do not use an ontology for Korean translation . ##W09-0407 The RWTH System Combination System for WMT 2009 . RWTH participated in the System Combination task of the Fourth Workshop on Statistical Machine Translation -LRB- WMT 2009 -RRB- . Hypotheses from 9 German → English MT systems were combined into a consensus translation . This consensus translation scored 2.1 % better in BLEU and 2.3 % better in TER -LRB- abs . -RRB- than the best single system . In addition , cross-lingual output from 10 French , German , and Spanish → English systems was combined into a consensus translation , which gave an improvement of 2.0 % in BLEU\/3 .5 % in TER -LRB- abs . -RRB- over the best single system . ##P08-2037 Event Matching Using the Transitive Closure of Dependency Relations . This paper describes a novel event-matching strategy using features obtained from the transitive closure of dependency relations . The method yields a model capable of matching events with an F-measure of 66.5 % . ##W99-0606 Boosting Applied To Tagging And PP Attachment . Boosting is a machine learning algorithm that is not well known in computational linguistics . We apply it to part-of-speech tagging and prepositional phrase attachment . Performance is very encouraging . We also show how to improve data quality by using boosting to identify annotation errors . ##I08-5013 Named Entity Recognition for South Asian Languages . Much work has already been done on building named entity recognition systems . However most of this work has been concentrated on English and other European languages . Hence , building a named entity recognition -LRB- NER -RRB- system for South Asian Languages -LRB- SAL -RRB- is still an open problem because they exhibit characteristics different from English . This paper builds a named entity recognizer which also identifies nested name entities for the Hindi language using machine learning algorithm , trained on an annotated corpus . However , the algorithm is designed in such a manner that it can easily be ported to other South Asian Languages provided the necessary NLP tools like POS tagger and chunker are available for that language . I compare results of Hindi data with English data of CONLL shared task of 2003 . ##J87-3002 Large Lexicons For Natural Language Processing : Utilising The Grammar Coding System Of LDOCE . This article focusses on the derivation of large lexicons for natural language processing . We describe thedevelopment of a dictionary support environment linking a restructured version of the Longman Dictionary of Contemporary English to natural language processing systems . The process of restructuring the information in the machine readable version of the dictionary is discussed . The Longman grammar code system is used to construct ` theory neutral ' lexical entries . We demonstrate how such lexical entries can be put to practical use by linking up the system described here with the experimental PATR-II grammar development environment . Finally , we offer an evaluation of the utility of the grammar coding system for use by automatic natural language parsing systems . ##H92-1088 Towards Using Prosody In Speech Recognition\/Understanding Systems : Differences Between Read And Spontaneous Speech . A persistent problem for keyword-driven speech recognition systems is that users often embed the to-be-recognized words or phrases in longer utterances . The recognizer needs to locate the relevant sections of the speech signal and ignore extraneous words . Prosody might provide an extra source of information to help locate target words embedded in other speech . In this paper we examine some prosodic characteristics of 160 such utterances and compare matched read and spontaneous versions . Half of the utterances are from a corpus of spontaneous answers to requests for the name of a city , recorded from calls to Directory Assistance Operators . The other half are the same word strings read by volunteers attempting to model the real dialogue . Results show a consistent pattern across both sets of data : embedded city names almost always bear nuclear pitch accents and are in their own intonational phrases . However the distributions of tonal make-up of these prosodic features differ markedly in read versus spontaneous speech , implying that if algorithms that exploit these prosodic regularities are trained on read speech , then the probabilities are likely to be incorrect models of real user speech . ##W04-3213 Unsupervised Semantic Role Labeling . We present an unsupervised method for labeling the arguments of verbs with their semantic roles . Our bootstrapping algorithm makes initial unambiguous role assignments , and then iteratively updates the probability model on which future assignments are based . A novel aspect of our approach is the use of verb , slot , and noun class information as the basis for backing off in our probability model . We achieve 50 -- 65 % reduction in the error rate over an informed baseline , indicating the potential of our approach for a task that has heretofore relied on large amounts of manually generated training data . ##W03-1603 Preferential Presentation Of Japanese Near-Synonyms Using Definition Statements . This paper proposes a new method of ranking near-synonyms ordered by their suitability of nuances in a particular context . Our method distincts near-synonyms by semantic features extracted from their definition statements in an ordinary dictionary , and ranks them by the types of features and a particular context . Our method is an initial step to achieve a semantic paraphrase system for authoring support . ##W06-1007 Structural Properties Of Lexical Systems : Monolingual And Multilingual Perspectives . We introduce a new type of lexical structure called lexical system , an interoperable model that can feed both monolingual and multilingual language resources . We begin with a formal characterization of lexical systems as `` pure '' directed graphs , solely made up of nodes corresponding to lexical entities and links . To illustrate our approach , we present data borrowed from a lexical system that has been generated from the French DiCo database . We later explain how the compilation of the original dictionary-like database into a net-like one has been made possible . Finally , we discuss the potential of the proposed lexical structure for designing multilingual lexical resources . ##P07-2055 A Hybrid Approach to Word Segmentation and POS Tagging . In this paper , we present a hybrid method for word segmentation and POS tagging . The target languages are those in which word boundaries are ambiguous , such as Chinese and Japanese . In the method , word-based and character-based processing is combined , and word segmentation and POS tagging are conducted simultaneously . Experimental results on multiple corpora show that the integrated method has high accuracy . ##H94-1120 Natural Language Planning Dialogue For Interactive . ##W06-3325 The Difficulties Of Taxonomic Name Extraction And A Solution . In modern biology , digitization of biosystematics publications is an important task . Extraction of taxonomic names from such documents is one of its major issues . This is because these names identify the various genera and species . This article reports on our experiences with learning techniques for this particular task . We say why established Named-Entity Recognition techniques are somewhat difficult to use in our context . One reason is that we have only very little training data available . Our experiments show that a combining approach that relies on regular expressions , heuristics , and word-level language recognition achieves very high precision and recall and allows to cope with those difficulties . ##N09-2065 Recognising the Predicate-argument Structure of Tagalog . This paper describes research on parsing Tagalog text for predicate -- argument structure -LRB- PAS -RRB- . We first outline the linguistic phenomenon and corpus annotation process , then detail a series of PAS parsing experiments . ##W09-0435 A POS-Based Model for Long-Range Reorderings in SMT . In this paper we describe a new approach to model long-range word reorderings in statistical machine translation -LRB- SMT -RRB- . Until now , most SMT approaches are only able to model local reorderings . But even the word order of related languages like German and English can be very different . In recent years approaches that reorder the source sentence in a preprocessing step to better match target sentences according to POS -LRB- Part-of-Speech -RRB- - based rules have been applied successfully . We enhance this approach to model long-range reorderings by introducing discontinuous rules . We tested this new approach on a GermanEnglish translation task and could significantly improve the translation quality , by up to 0.8 BLEU points , compared to a system which already uses continuous POSbased rules to model short-range reorderings . ##N09-1066 Using Citations to Generate surveys of Scientific Paradigms . The number of research publications in various disciplines is growing exponentially . Researchers and scientists are increasingly finding themselves in the position of having to quickly understand large amounts of technical material . In this paper we present the first steps in producing an automatically generated , readily consumable , technical survey . Specifically we explore the combination of citation information and summarization techniques . Even though prior work -LRB- Teufel et al. , 2006 -RRB- argues that citation text is unsuitable for summarization , we show that in the framework of multi-document survey creation , citation texts can play a crucial role . ##H05-1069 Word Sense Disambiguation Using Sense Examples Automatically Acquired From A Second Language . We present a novel almost-unsupervised approach to the task of Word Sense Disambiguation -LRB- WSD -RRB- . We build sense examples automatically , using large quantities of Chinese text , and English-Chinese and Chinese-English bilingual dictionaries , taking advantage of the observation that mappings between words and meanings are often different in typologically distant languages . We train a classifier on the sense examples and test it on a gold standard English WSD dataset . The evaluation gives results that exceed previous state-of-the-art results for comparable systems . We also demonstrate that a little manual effort can improve the quality of sense examples , as measured by WSD accuracy . The performance of the classifier on WSD also improves as the number of training sense examples increases . ##C00-1029 A Class-Based Probabilistic Approach To Structural Disambiguation . Knowledge of which words are able to fill p ~ rticular argum . ent slots of a predicate can be used tbr structural disambiguation . This paper describes a proposal : for acquiring such knowledge , and in line with much of the recent work in this area , a probabilistic approach is taken . We develop a novel way of using a semantic hierarchy to estimate the probabilities , and demonstrate the general approach using a prepositional phrase atta . chment experiment . ##C08-1054 Coordination Disambiguation without Any Similarities . The use of similarities has been one of the main approaches to resolve the ambiguities of coordinate structures . In this paper , we present an alternative method for coordination disambiguation , which does not use similarities . Our hypothesis is that coordinate structures are supported by surrounding dependency relations , and that such dependency relations rather yield similarity between conjuncts , which humans feel . Based on this hypothesis , we built a Japanese fully-lexicalized generative parser that includes coordination disambiguation . Experimental results on web sentencesindicatedtheeffectivenessofor approach , and endorsed our hypothesis . ##N09-2037 Evaluating the Syntactic Transformations in Gold Standard Corpora for Statistical Sentence Compression . We present a policy-based eror analysis aproach that demonstrates a limitation to the curent comonly adopted paradigm for sentence compresion . We demonstrate that these limitations arise from the strong asumption of locality of the decision making proces in the search for an aceptable derivation in this paradigm . ##W08-1109 The Use of Spatial Relations in Referring Expression Generation . There is a prevailing assumption in the literature on referring expression generation that relations are used in descriptions only ` as a last resort ' , typically on the basis that including the second entity in the relation introduces an additional cognitive load for either speaker or hearer . In this paper , we describe an experiemt that attempts to test this assumption ; we determine that , even in simple scenes where the use of relations is not strictly required in order to identify an entity , relations are in fact often used . We draw some conclusions as to what this means for the development of algorithms for the generation of referring expressions . ##P08-1072 Robust Dialog Management with N-Best Hypotheses Using Dialog Examples and Agenda . This work presents an agenda-based approach to improve the robustness of the dialog manager by using dialog examples and n-best recognition hypotheses . This approach supports n-best hypotheses in the dialog manager and keeps track of the dialog state using a discourse interpretation algorithm with the agenda graph and focus stack . Given the agenda graph and n-best hypotheses , the system can predict the next system actions to maximize multi-level score functions . To evaluate the proposed method , a spoken dialog system for a building guidance robot was developed . Preliminary evaluation shows this approach would be effective to improve the robustness of example-based dialog modeling . ##P91-1025 Resolving Translation Mismatches With Information Flow . Languages differ in the concepts and real-world entities for which they have words and grammatical constructs . Therefore translation must sometimes be a matter of approximating the meaning of a source language text rather than finding an exact counterpart in the target language . We propose a translation framework based on Situation Theory . The basic ingredients are an information lattice , a representation scheme for utterances embedded in contexts , and a mismatch resolution scheme defined in terms of information flow . We motivate our approach with examples of translation between English and Japanese . ##C04-1093 Summarizing Encyclopedic Term Descriptions On The Web . We are developing an automatic method to compile an encyclopedic corpus from the Web . In our previous work , paragraph-style descriptions for a term are extracted from Web pages and organized based on domains . However , these descriptions are independent and do not comprise a condensed text as in hand-crafted encyclopedias . To resolve this problem , we propose a summarization method , which produces a single text from multiple descriptions . The resultant summary concisely describes a term from different viewpoints . We also show the effectiveness of our method by means of experiments . ##P96-1021 A Polynomial-Time Algorithm For Statistical Machine Translation . We introduce a polynomial-time algorithm for statistical machine translation . This algorithm can be used in place of the expensive , slow best-first search strategies in current statistical translation architectures . The approach employs the stochastic bracketing transduction grammar -LRB- SBTG -RRB- model we recently introduced to replace earlier word alignment channel models , while retaining a bigram language model . The new algorithm in our experience yields major speed improvement with no significant loss of accuracy . ##W01-0704 Semantic Pattern Learning Through Maximum Entropy-Based WSD Technique . This paper describes a Natural Language Learning method that extracts knowledge in the form of semantic patterns with ontology elements associated to syntactic components in the text . The method combines the use of EuroWordNet 's ontological concepts and the correct sense of each word assigned by a Word Sense Disambiguation -LRB- WSD -RRB- module to extract three sets of patterns : subject-verb , verb-direct object and verb-indirect object . These sets define the semantic behavior of the main textual elements based on their syntactic role . On the one hand , it is shown that Maximum Entropy models applied to WSD tasks provide good results . The evaluation of the WSD module has revealed a accuracy rate of 64 % in a preliminary test . On the other hand , we explain how an adequate set of semantic or ontological patterns can improve the success rate of NLP tasks such us pronoun resolution . We have implemented both modules in C + + and although the evaluation has been performed for English , their general features allow the treatment of other languages like Spanish . a1This paper has been partially supported by the Spanish Government -LRB- CICYT -RRB- project number TIC2000-0664-C0202 . ##C92-2089 A Feature-Based Model For Lexical Databases . To date , no fully suitable data model for lexical databases has been proposed . As lexical databases have prolifcrated in multiple formats , there has been growing concern over the reusability of lexical resources . In this paper , we propose a model based on feature structures which overcomes most of the problems inherent in classical database models , anti in particular enables accessing , manipulating or merging information structured in multiple ways . Because of their widespread use in file representation of linguistic information , the applicability of feature structures to lexical databases seems natural , although to our knowledge this has not yet been implemented . The nse of feature structures in lexical databases also opens up the possibility of compatibility with computational lexicons . ##W06-1640 Partially Supervised Coreference Resolution For Opinion Summarization Through Structured Rule Learning . Combining fine-grained opinion information to produce opinion summaries is important for sentiment analysis applications . Toward that end , we tackle the problem of source coreference resolution -- linking together source mentions that refer to the same entity . The partially supervised nature of the problem leads us to define and approach it as the novel problem of partially supervised clustering . We propose and evaluate a new algorithm for the task of source coreference resolution that outperforms competitive baselines . ##D07-1118 Building Domain-Specific Taggers without Annotated -LRB- Domain -RRB- Data . Part of spech taging is a fundamental component in many NLP systems . When tagers developed in one domain are used in another domain , the perforance can degrade considerably . We present a method for developing tagers for new doains without requiring POS anotated text in the ne domain . Our method involves using raw doain text and identifying related ords to form a domain specific lexicon . This lexicon provides the initial lexical probabilities for EM trainig of an HM model . We evaluate the method by aplying it in the Biolgy doain and show that we achieve results that are comparable ith some tagers developed for this domain . ##E09-2003 Grammatical Framework Web Service . We present a web service for natural language parsing , prediction , generation , and translation using grammars in Portable Grammar Format -LRB- PGF -RRB- , the target format of the Grammatical Framework -LRB- GF -RRB- grammar compiler . The web service implementation is open source , works with any PGF grammar , and with any web server that supports FastCGI . The service exposes a simple interface which makes it possible to use it for interactive natural language web applications . We describe the functionality and interface of the web service , and demonstrate several applications built on top of it . ##W91-0114 Shared Preferences . This paper attempts to develop a theory of heuristics or preferences that can be shared between understanding and generation systems . We first develop a formal analysis of preferences and consider the relation between their uses in generation and understanding . We then present a bidirectional algorithm for applying them and examine typical heuristics for lexical choice , scope and anaphora in : , more detail . ##C88-1003 Functional Constraints In Knowledge-Based Natural Language Understanding . Many knowledge-based systems of semantic interpretation rely explicitly or implicitly on an assumption of structural isomorphy between syntaotic and semantic objects , handling exceptions by ad hoc measures . In this paper I argue that constraint equations of the kind used in the LFG -LRB- or PATR - -RRB- formalisms provide a more general , and yet restricted formalism : in which not only isomorphic correspondences are expressible , but also many cases of non-isomorphic correspondences . I illustrate with treatments of idioms , speech act interpretation and discourse pragmatics . ##W04-1312 Modelling Atypical Syntax Processing . We evaluate the inferences that can be drawn from dissociations in syntax processing identified in developmental disorders and acquired language deficits . We use an SRN to simulate empirical data from Dick et al. -LRB- 2001 -RRB- on the relative difficulty of comprehending different syntactic constructions under normal conditions and conditions of damage . We conclude that task constraints and internal computational constraints interact to predict patterns of difficulty . Difficulty is predicted by frequency of constructions , by the requirement of the task to focus on local vs. global sequence information , and by the ability of the system to maintain sequence information . We generate a testable prediction on the empirical pattern that should be observed under conditions of developmental damage . ##A88-1008 Handling Scope Ambiguities In English . This paper describes a program for handling `` scope ambiguities '' in individual English sentences . The program operates on initial logical translations , generated by a parser\/translator , in which `` unscoped elements '' such as quantifiers , coordinators and negation are left in place to be extracted and positioned by the scoping program . The program produces the set of valid scoped readings , omitting logically redundant readings , and places the readings in an approximate order of preference using a set of domain-independent heuristics . The heuristics are based on information about the lexical type of each operator and on `` structural relations '' between pairs of operators . The need for such domain-independent heuristics is emphasized ; in some cases they can be decisive and in general they will serve as a guide to the use of further heuristics based on domain-specific knowledge and on the context of discourse . The emphasis of this paper is on discussing several of the more problematic aspects of the scoping protocol which wcre encountered during the design of the scoping program . ##W00-1402 A Task-Based Framework To Evaluate Evaluative Arguments . We present an evaluation framework in which the effectiveness of evaluative arguments can be measured with real users . The framework is based on the task-efficacy evaluation method . An evaluative argument is presented in the context of a decision task and measures related to its effectiveness are assessed . Within this framework , we are currently running a formal experiment to verify whether argument effectiveness can be increased by tailoring the argument to the user and by varying the degree of argument conciseness . ##W09-1408 A memory-based learning approach to event extraction in biomedical texts . In this paper we describe the memory-based machine learning system that we submitted to the BioNLP Shared Task on Event Extraction . We modeled the event extraction task using an approach that has been previously applied to other natural language processing tasks like semantic role labeling or negation scope finding . The results obtained by our system -LRB- 30.58 F-score in Task 1 and 29.27 in Task 2 -RRB- suggest that the approach and the system need further adaptation to the complexity involved in extracting biomedical events . ##C04-1118 Controlling Gender Equality With Shallow NLP Techniques . This paper introduces the Gendercheck Editor '' , a tool to check German texts for gender discriminatory formulations . It relays on shallow rule-based techniques as used in the Controlled Language Authoring Technology -LRB- CLAT -RRB- . The paper outlines major sources of gender imbalances in German texts . It gives a background on the underlying CLAT technology and describes the marking and annotation strategy to automatically detect and visualize the questionable pieces of text . The paper provides a detailed evaluation of the editor . ##P91-1039 Factorization Of Language Constraints In Speech Recognition . Integration of language constraints into a large vocabulary speech recognition system often leads to prohibitive complexity . We propose to factor the constraints into two components . The first is characterized by a covering grammar which is small and easily integrated into existing speech recognizers . The recognized string is then decoded by means of an efficient language post-processor in which the full set of constraints is imposed to correct possible errors introduced by the speech recognizer . ##I05-2011 Automatic Detection of Opinion Bearing Words and Sentences . We describe a sentence-level opinion detection system . We first define what an opinion means in our research and introduce an effective method for obtaining opinion-bearing and nonopinion-bearing words . Then we describe recognizing opinion-bearing sentences using these words We test the system on 3 different test sets : MPQA data , an internal corpus , and the TREC2003 Novelty track data . We show that our automatic method for obtaining opinion-bearing words can be used effectively to identify opinion-bearing sentences . ##H92-1085 Automatic Detection And Correction Of Repairs In Human-Computer Dialog . We have analyzed 607 sentences of spontaneous humancomputer speech data containing repairs -LRB- drawn from a corpus of 10,718 -RRB- . We present here criteria and techniques for automatically detecting the presence of a repair , its location , and making the appropriate correction. The criteria involve integration of knowledge from several sources : pattern matching , syntactic and semantic analysis , and acoustics . ##J95-4003 Modularity And Information Content Classes In Principle-Based Parsing . In recent years models of parsing that are isomorphic to a principle-based theory of grammar -LRB- most notably Government and Binding -LRB- GB -RRB- Theory -RRB- have been proposed -LRB- Berwick et al. 1991 -RRB- . These models are natural and direct implementations of the grammar , but they are not efficient , because GB is not a computationally modular theory . This paper investigates one problem related to the tension between building linguistically based parsers and building efficient ones . In particular , the issue of what is a linguistically motivated way of deriving a parser from principle-based theories of grammar is explored . It is argued that an efficient and faithful parser can be built by taking advantage of the way in which principles are stated . To support this claim , two features of an implemented parser are discussed . First , configurations and lexical information are precompiled separately into two tables -LRB- an X table and a table of lexical co-occurrence -RRB- which gives rise to more compact data structures . Secondly , precomputation of syntactic features -LRB- O-roles , case , etc. -RRB- results in efficient computation of chains , because it reduces several problems of chain formation to a local computation , thus avoiding extensive search of the tree for an antecedent or extensive backtracking . It is also shown that this method of building long-distance dependencies can be computed incrementally . ##N09-2017 Evaluation of a System for Noun Concepts Acquisition from Utterances about Images -LRB- SINCA -RRB- Using Daily Conversation Data . For a robot working in an open environment , a task-oriented language capability will not be sufficient . In order to adapt to the environment , such a robot will have to learn language dynamically . We developed a System for Noun Concepts Acquisition from utterances about Images , SINCA in short . It is a language acquisition system without knowledge of grammar and vocabulary , which learns noun concepts from user utterances . We recorded a video of a child 's daily life to collect dialogue data that was spoken to and around him . The child is a member of a family consisting of the parents and his sister . We evaluated the performance of SINCA using the collected data . In this paper , we describe the algorithms of SINCA and an evaluation experiment . We work on Japanese language acquisition , however our method can easily be adapted to other languages . ##H01-1011 Automatic Title Generation For Spoken Broadcast News . In this paper , we implemented a set of title generation methods using training set of 21190 news stories and evaluated them on an independent test corpus of 1006 broadcast news documents , comparing the results over manual transcription to the results over automatically recognized speech . We use both F1 and the average number of correct title words in the correct order as metric . Overall , the results show that title generation for speech recognized news documents is possible at a level approaching the accuracy of titles generated for perfect text transcriptions . Keywords Machine learning , title generation ##W97-0212 Sense Tagging In Action Combining Different Tests With Additive Weighangs . This paper describes a working sense tagger , which attempts to automatically link each word in a text corpus to its corresponding sense in a machinereadable dictionary . It uses information automatically extracted from the MRD to find matches between the dictionary and the Corpus sentences , and combines different types of information by simple additive scores with manually set weightings . ##J88-3003 Modeling The User 's Plans And Goals . This work is an ongoing research effort aimed both at developing techniques for inferring and constructing a user model from an information-seeking dialog and at identifying strategies for applying this model to enhance robust communication . One of the most important components of a user model is a representation of the system 's beliefs about the underlying task-related plan motivating an information-seeker 's queries . These beliefs can be used to interpret subsequent utterances and produce useful responses . This paper describes the IREPS system , emphasizing its dynamic construction of the task-related plan motivating the information-seeker 's queries and the application of this component of a user model to handling utterances that violate the pragmatic rules of the system 's world model . By reasoning on a model of the user 's plans and goals , the system often can deduce the intended meaning of faulty utterances and allow the dialogue to continue without interruption . Some limitations of current plan inference systems are discussed . It is suggested that the problem of detecting and recovering from discrepancies between the system 's model of the user 's plan and the actual plan under construction by the user requires an enriched model that differentiates among its components on the basis of the support the system accords each component as a correct and intended part of the user 's plan . ##E87-1012 A Tool For The Automatic Creation , Extension And Updating Of Lexical Knowledge Bases . A tool is described which helps in the creation , extension and updating of lexical knowledge bases -LRB- LKBs -RRB- . Two levels of representation are distinguished : a static storage level and a dynamic knowledge level . The latter is an object-oriented environment containing linguistic and lexicographic knowledge . At the knowledge level , constructors and filters can be defined . Constructors are objects which extend the LKB both horizontally -LRB- new information -RRB- and vertically -LRB- new entries -RRB- using the linguistic knowledge . Filters are objects which derive new LKBs from existing ones thereby optionally changing the storage structure . The latter use lexicographic knowledge . ##W98-1123 Linear Segmentation And Segment Significance . We present a new method for discovering a segmental discourse structure of a document while categorizing each segment 's function and importance . Segments are determined by a zero-sum weighting scheme , used on occurrences of noun phrases and pronominal forms retrieved from the document . Segment roles are then calculated from the distribution of the terms in the segment . Finally , we present results of evaluation in terms of precision and recall which surpass earlier approaches ' . ##P97-1067 Choosing The Word Most Typical In Context Using A Lexical Co-Occurrence Network . This paper presents a partial solution to a component of the problem of lexical choice : choosing the synonym most typical , or expected , in context . We apply a new statistical approach to representing the context of a word through lexical co-occurrence networks . The implementation was trained and evaluated on a large corpus , and results show that the inclusion of second-order co-occurrence relations improves the performance of our implemented lexical choice program . ##P08-1078 Contextual Preferences . The validity of semantic inferences depends on the contexts in which they are applied . We propose a generic framework for handling contextual considerations within applied inference , termed Contextual Preferences . This framework defines the various context-aware components needed for inference and their relationships . Contextual preferences extend and generalize previous notions , such as selectional preferences , while experiments show that the extended framework allows improving inference quality on real application data . ##W04-1108 Combining Neural Networks And Statistics For Chinese Word Sense Disambiguation . The input of network is the key problem for Chinese Word sense disambiguation utilizing the Neural Network . This paper presents an input model of Neural Network that calculates the Mutual Information between contextual words and ambiguous word by using statistical method and taking the contextual words to certain number beside the ambiguous word according to -LRB- - M , + N -RRB- . The experiment adopts triple-layer BP Neural Network model and proves how the size of training set and the value of M and N affect the performance of Neural Network model . The experimental objects are six pseudowords owning three word-senses constructed according to certain principles . Tested accuracy of our approach on a close-corpus reaches 90.31 % , , and 89.62 % on a open-corpus . The experiment proves that the Neural Network model has good performance on Word sense disambiguation . ##D08-1112 An Analysis of Active Learning Strategies for Sequence Labeling Tasks . Active learning is well-suited to many problems in natural language processing , where unlabeled data may be abundant but annotation is slow and expensive . This paper aims to shed light on the best active learning approaches for sequence labeling tasks such as information extraction and document segmentation . Wesurveypreviouslyusedqueryselection strategies for sequence models , and propose several novel algorithms to address their shortcomings . We also conduct a large-scale empirical comparison using multiple corpora , which demonstrates that our proposed methods advance the state of the art . ##H05-1002 Data-Driven Approaches For Information Structure Identification . This paper investigates automatic identification of Information Structure -LRB- IS -RRB- in texts . The experiments use the Prague Dependency Treebank which is annotated with IS following the Praguian approach of Topic Focus Articulation . We automatically detect t -LRB- opic -RRB- and f -LRB- ocus -RRB- , using node attributes from the treebank as basic features and derived features inspired by the annotation guidelines . We present the performance of decision trees -LRB- C4 .5 -RRB- , maximum entropy , and rule induction -LRB- RIPPER -RRB- classifiers on all tectogrammatical nodes . We compare the results against a baseline system that always assigns f -LRB- ocus -RRB- and against a rule-based system . The best system achieves an accuracy of 90.69 % , which is a 44.73 % improvement over the baseline -LRB- 62.66 % -RRB- . ##W06-2502 Cluster Stopping Rules For Word Sense Discrimination . As text data becomes plentiful , unsupervised methods for Word Sense Disambiguation -LRB- WSD -RRB- become more viable . A problem encountered in applying WSD methods is finding the exact number of senses an ambiguity has in a training corpus collected in an automated manner . That number is not known a priori ; rather it needs to be determined based on the data itself . We address that problem using cluster stopping methods . Such techniques have not previously applied to WSD . We implement the methods of Calinski and Harabasz -LRB- 1975 -RRB- and Hartigan -LRB- 1975 -RRB- and our adaptation of the Gap statistic -LRB- Tibshirani , Walter and Hastie , 2001 -RRB- . For evaluation , we use the WSD Test Set from the National Library of Medicine , whose sense inventory is the Unified Medical Language System . The best accuracy for selecting the correct number of clusters is 0.60 with the C&H method . Our error analysis shows that the cluster stopping methods make finergrained sense distinctions by creating additional clusters . The highest F-scores -LRB- 82.89 -RRB- , indicative of the quality of cluster membership assignment , are comparable to the baseline majority sense -LRB- 82.63 -RRB- and point to a path towards accuracy improvement via additional cluster pruning . The importance and significance of the current work is in applying cluster stopping rules to WSD . ##W91-0102 Reversibility In A Constraint And Type Based Logic Grammar : Application To Secondary Predication . In this document , we present a formalism for natural language processing which associates type construction principles to constraint logic programming . We show that it provides more uniform , expressive and efficient tools for parsing and generating language . Next , we present two abstract machines which enable us to design , in a symmetric way , a parser and a generator from that formalism . This abstract machinery is then exemplified by a detailed study of secondary predication within the framework of a principledbased description of language : Government and Binding theory . ##C08-1023 Pedagogically Useful Extractive Summaries for Science Education . This paper describes the design and evaluation of an extractive summarizer for educational science content called COGENT . COGENT extends MEAD based on strategies elicited from an empirical study with science domain and instructional design experts . COGENT identifies sentences containing pedagogically relevant concepts for a specific science domain . The algorithms pursue a hybrid approach integrating both domain independent bottom-up sentence scoring features and domain-aware top-down features . Evaluation results indicate that COGENT outperforms existing summarizers and generates summaries that closely resemble those generated by human experts . COGENT concept inventories appear to also support the computational identification of student misconceptions about earthquakes and plate tectonics . ##D08-1090 Language and Translation Model Adaptation using Comparable Corpora . Traditionally , statistical machine translation systems have relied on parallel bi-lingual data to train a translation model . While bi-lingual parallel data are expensive to generate , monolingual data are relatively common . Yet monolingual data have been under-utilized , having been used primarily for training a language model in the target language . This paper describes a novel method for utilizing monolingual target data to improve the performance of a statistical machine translation system on news stories . The method exploits the existence of comparable text -- multiple texts in the target language that discuss the same or similar stories as found in the source language document . For every source document that is to be translated , a large monolingual data set in the target language is searched for documents that might be comparable to the source documents . These documents are then used to adapt the MT system to increase the probability of generating texts that resemble the comparable document . Experimental results obtained by adapting both the language and translation models show substantial gains over the baseline system . ##N06-2039 Unsupervised Induction Of Modern Standard Arabic Verb Classes . We exploit the resources in the Arabic Treebank -LRB- ATB -RRB- for the novel task of automatically creating lexical semantic verb classes for Modern Standard Arabic -LRB- MSA -RRB- . Verbs are clustered into groups that share semantic elements of meaning as they exhibit similar syntactic behavior . The results of the clustering experiments are compared with a gold standard set of classes , which is approximated by using the noisy English translations provided in the ATB to create Levin-like classes for MSA . The quality of the clusters is found to be sensitive to the inclusion of information about lexical heads of the constituents in the syntactic frames , as well as parameters of the clustering algorithm . The best set of parameters yields an F β = 1 score of 0.501 , compared to a random baseline with an F β = 1 score of 0.37 . ##W07-1416 Textual Entailment Using Univariate Density Model and Maximizing Discriminant Function . The primary focuses of this entry this year was firstly , to develop a framework to allow multiple researchers from our group to easily contribute metrics measuring textual entailment , and secondly , to provide a baseline which we could use in our tools to evaluate and compare new metrics . A development environment tool was created to quickly allow for testing of various metrics and to easily randomize the development and test sets . For each test , this RTE tool calculated two sets of results by applying the metrics to both a univariate Gaussian density and by maximizing a linear discriminant function . The metrics used for the submission were a lexical similarity metric and a lexical similarity metric using synonym and antonym replacement . The two submissions for RTE 2007 scored an accuracy of 61.00 % and 62.62 % . ##H93-1066 A Speech-First Model For Repair Detection And Correction . Interpreting fttUy natural speech is an important goal for spoken language understanding systems . However , while corpus studies have shown that about 10 % of spontaneous utterances contain selfcorrections , or REPAIRS , little is known about the extent to which cues in the speech signal may facilitate repair processing . We identify several cues based on acoustic and prosodic analysis of repairs in the DARPA Air Travel In . formation System database , and propose methods for exploiting these cues to detect and correct repairs . ##C08-5001 Advanced Dynamic Programming in Semiring and Hypergraph Frameworks . Dynamic Programming -LRB- DP -RRB- is an important class of algorithms widely used in many areas of speech and language processing . Recently there have been a series of work trying to formalize many instances of DP algorithms under algebraic and graph-theoretic frameworks . This tutorial surveys two such frameworks , namely semirings and directed hypergraphs , and draws connections between them . We formalize two particular types of DP algorithms under each of these frameworks : the Viterbi-style topological algorithms and the Dijkstra-style best-first algorithms . Wherever relevant , we also discuss typical applications of these algorithms in Natural Language Processing . ##W05-0836 Training And Evaluating Error Minimization Decision Rules For Statistical Machine Translation . Decision rules that explicitly account for non-probabilistic evaluation metrics in machine translation typically require special training , often to estimate parameters in exponential models that govern the search space and the selection of candidate translations . While the traditional Maximum A Posteriori -LRB- MAP -RRB- decision rule can be optimized as a piecewise linear function in a greedy search of the parameter space , the Minimum Bayes Risk -LRB- MBR -RRB- decision rule is not well suited to this technique , a condition that makes past results difficult to compare . We present a novel training approach for non-tractable decision rules , allowing us to compare and evaluate these and other decision rules on a large scale translation task , taking advantage of the high dimensional parameter space available to the phrase based Pharaoh decoder . This comparison is timely , and important , as decoders evolve to represent more complex search space decisions and are evaluated against innovative evaluation metrics of translation quality . ##C94-1084 Towards Automatic Extraction Of Monolingual And Bilingual Terminology . In this paper , we make use of linguistic knowledge to identify certain noun phrases , both in English and French , which are likely to be terms . We then test and cmnl -RRB- are -LRB- lifl ` e. rent statistical scores to select the `` good '' ones among tile candidate terms , and finally propose a statistical method to build correspondences of multi-words units across languages . Acknowledgement Most of this work was carried out under project EUII . OTP ~ A ET-10 \/ 63 , co-sponsored by the European Economic Conmmnity . ##N04-1008 Automatic Question Answering : Beyond The Factoid . In this paper we describe and evaluate a Question Answering system that goes beyond answering factoid questions . We focus on FAQlike questions and answers , and build our system around a noisy-channel architecture which exploits both a language model for answers and a transformation model for answer\/question terms , trained on a corpus of 1 million question\/answer pairs collected from the Web . ##W97-0613 The `` Casual Cashmere Diaper Bag '' : Constraining Speech Recognition Using Examples . We describe a new technology for using small collections of example sentences to automatically restrict a speech recognition grammar to allow only the more plausible subset of the sentences it would otherwise admit . This technology is unusual because it bridges the gap between hand-built grammars -LRB- used with no training data -RRB- and statistical approaches -LRB- which require significant data -RRB- . ##P06-1022 Dependency Parsing Of Japanese Spoken Monologue Based On Clause Boundaries . Spoken monologues feature greater sentence length and structural complexity than do spoken dialogues . To achieve high parsing performance for spoken monologues , it could prove effective to simplify the structure by dividing a sentence into suitable language units . This paper proposes a method for dependency parsing of Japanese monologues based on sentence segmentation . In this method , the dependency parsing is executed in two stages : at the clause level and the sentence level . First , the dependencies within a clause are identified by dividing a sentence into clauses and executing stochastic dependency parsing for each clause . Next , the dependencies over clause boundaries are identified stochastically , and the dependency structure of the entire sentence is thus completed . An experiment using a spoken monologue corpus shows this method to be effective for efficient dependency parsing of Japanese monologue sentences . ##P08-3012 A Hierarchical Approach to Encoding Medical Concepts for Clinical Notes . This paper proposes a hierarchical text categorization -LRB- TC -RRB- approach to encoding free-text clinical notes with ICD-9-CM codes . Preliminary experimental result on the 2007 Computational Medicine Challenge data shows a hierarchical TC system has achieved a microaveraged F1 value of 86.6 , which is comparable to the performance of state-of-the-art flat classification systems . ##E09-3005 Structural Correspondence Learning for Parse Disambiguation . The paper presents an application of Structural Correspondence Learning -LRB- SCL -RRB- -LRB- Blitzer et al. , 2006 -RRB- for domain adaptation of a stochastic attribute-value grammar -LRB- SAVG -RRB- . So far , SCL has been applied successfully in NLP for Part-of-Speech tagging and Sentiment Analysis -LRB- Blitzer et al. , 2006 ; Blitzer et al. , 2007 -RRB- . An attempt was made in the CoNLL 2007 shared task to apply SCL to non-projective dependency parsing -LRB- Shimizu and Nakagawa , 2007 -RRB- , however , without any clear conclusions . We report on our exploration of applying SCL to adapt a syntactic disambiguation model and show promising initial results on Wikipedia domains . ##C08-1056 Normalizing SMS : are Two Metaphors Better than One ? Electronic written texts used in computermediated interactions -LRB- e-mails , blogs , chats , etc -RRB- present major deviations from the norm of the language . This paper presents an comparative study of systems aiming at normalizing the orthography of French SMS messages : after discussing the linguistic peculiarities of these messages , and possible approaches to their automatic normalization , we present , evaluate and contrast two systems , one drawing inspiration from the Machine Translation task ; the other using techniques that are commonly used in automatic speech recognition devices . Combining both approaches , our best normalization system achieves about 11 % Word Error Rate on a test set of about 3000 unseen messages . ##C04-1131 Word Sense Disambiguation Criteria : A Systematic Study . This article describes the results of a systematic indepth study of the criteria used for word sense disambiguation . Our study is based on 60 target words : 20 nouns , 20 adjectives and 20 verbs . Our results are not always in line with some practices in the field . For example , we show that omitting noncontent words decreases performance and that bigrams yield better results than unigrams . ##W03-0428 Named Entity Recognition With Character-Level Models . We discuss two named-entity recognition models which use characters and character a4 - grams either exclusively or as an important part of their data representation . The first model is a character-level HMM with minimal context information , and the second model is a maximum-entropy conditional markov model with substantially richer context features . Our best model achieves an overall Fa5 of 86.07 % on the English test data -LRB- 92.31 % on the development data -RRB- . This number represents a 25 % error reduction over the same model without word-internal -LRB- substring -RRB- features . ##W06-3005 A Data Driven Approach To Relevancy Recognition For Contextual Question Answering . Contextual question answering -LRB- QA -RRB- , in which users ' information needs are satisfied through an interactive QA dialogue , has recently attracted more research attention . One challenge of engaging dialogue into QA systems is to determine whether a question is relevant to the previous interaction context . We refer to this task as relevancy recognition . In this paper we propose a data driven approach for the task of relevancy recognition and evaluate it on two data sets : the TREC data and the HandQA data . The results show that we achieve better performance than a previous rule-based algorithm . A detailed evaluation analysis is presented . ##I08-1031 Hypothesis Selection in Machine Transliteration : A Web Mining Approach . We propose a new method of selecting hypotheses for machine transliteration . We generate a set of Chinese , Japanese , and Korean transliteration hypotheses for a given English word . We then use the set of transliteration hypotheses as a guide to finding relevant Web pages and mining contextual information for the transliteration hypotheses from the Web page . Finally , we use the mined information for machine-learning algorithms including support vector machines and maximum entropy model designed to select the correct transliteration hypothesis . In our experiments , our proposed method based on Web mining consistently outperformed systems based on simple Web counts used in previous work , regardless of the language . ##C02-1009 A Robust Cross-Style Bilingual Sentences Alignment Model . Most current sentence alignment approaches adopt sentence length and cognate as the alignment features ; and they are mostly trained and tested in the documents with the same style . Since the length distribution , alignment-type distribution -LRB- used by length-based approaches -RRB- and cognate frequency vary significantly across texts with different styles , the length-based approaches fail to achieve similar performance when tested incorpora ofdifferent styles . The experiments show that the performance in F-measure could drop from 98.2 % to 85.6 % when a length-based approach is trained by a technical manual and then tested on a general magazine . Sincealargepercentageofcontentwordsinthesource text would be translated into the corresponding translation duals to preserve the meaning in the target text , transfer lexicons are usually regarded as more reliable cues for aligning sentences when the alignment task is performed by human . To enhance the robustness , a robust statistical model based on both transfer lexicons and sentence lengths are proposed in this paper . After integrating the transfer lexicons into the model , a 60 % F-measure error reduction -LRB- from 14.4 % to 5.8 % -RRB- is observed . ##P94-1008 Common Topics And Coherent Situations : Interpreting Ellipsis In The Context Of Discourse Inference . It is claimed that a variety of facts concerning ellipsis , event reference , and interclausal coherence can be explained by two features of the linguistic form in question : -LRB- 1 -RRB- whether the form leaves behind an empty constituent in the syntax , and -LRB- 2 -RRB- whether the form is anaphoric in the semantics . It is proposed that these features interact with one of two types of discourse inference , namely Common Topic inference and Coherent Situation inference . The differing ways in which these types of inference utilize syntactic and semantic representations predicts phenomena for which it is otherwise difficult to account . ##W05-1605 Generating And Selecting Grammatical Paraphrases . Natural language has a high paraphrastic power yet not all paraphrases are appropriate for all contexts . In this paper , we present a TAG based surface realiser which supports both the generation and the selection of paraphrases . To deal with the combinatorial explosion typical of such an NP-complete task , we introduce a number of new optimisations in a tabular , bottom-up surface realisation algorithm . We then show that one of these optimisations supports paraphrase selection . ##P08-2052 FastSum : Fast and Accurate Query-based Multi-document Summarization . Wepresentafastquery-basedmulti-document summarizer called FastSum based solely on word-frequency features of clusters , documents and topics . Summary sentences are ranked by a regression SVM . The summarizer does not use any expensive NLP techniques such as parsing , tagging of names or even part of speech information . Still , the achieved accuracy is comparable to the best systems presented in recent academic competitions -LRB- i.e. , Document Understanding Conference -LRB- DUC -RRB- -RRB- . Because of a detailed feature analysis using Least Angle Regression -LRB- LARS -RRB- , FastSum can rely on a minimal set of featuresleading tofastprocessingtimes : 1250 news documents in 60 seconds . ##P07-1019 Forest Rescoring : Faster Decoding with Integrated Language Models . Efficient decoding has been a fundamental problem in machine translation , especially with an integrated language model which is essential for achieving good translation quality . We develop faster approaches for this problem based on k-best parsing algorithms and demonstrate their effectiveness on both phrase-based and syntax-based MT systems . In both cases , our methods achieve significant speed improvements , often by more than a factor of ten , over the conventional beam-search method at the same levels of search error and translation accuracy . ##P98-2181 Building Accurate Semantic Taxonomies from Monolingual MRDs . This paper presents a method that conbines a set of unsupervised algorithms in order to accurately build large taxonomies from any machine-readable dictionary -LRB- MRD -RRB- . Our aim is to profit from conventional MRDs , with no explicit semantic coding . We propose a system that 1 -RRB- performs fully automatic extraction of taxonomic links from MRD entries and 2 -RRB- ranks the extracted relations in a way that selective manual refinement is allowed . Tested accuracy can reach around 100 % depending on the degree of coverage selected , showing that taxonomy building is not limited to structured dictionaries such as LDOCE . ##C02-2004 A Linguistic Discovery Program That Verbalizes Its Discoveries . We describe a discovery program , called UNIVAUTO -LRB- UNIVersals AUthoringTOol -RRB- , whose domain of application is the study of language universals , a classic trend in contemporary linguistics . Accepting as input information about languages , presented in terms of feature-values , the discoveries of another human agent arising from the same data , as well as some additional data , the program discovers the universals in the data , compares them with the discoveries of the human agent and , if appropriate , generates a report in English on its discoveries . Running UNIVAUTO on the data from the seminal paper of Greenberg -LRB- 1966 -RRB- on word order universals , the system has produced several linguistically valuable texts , two of which are published in a refereed linguistic journal . ##W98-1421 Towards Multilingual Protocol Generation For Spontaneous Speech Dialogues . : This paper presents a novel multi-lingual progress protocol generation module . The module is used within the speech-to -- speech translation system VERBMOBIL . The task of the protocol is to give the dialogue partners a brief description of the content of their dialogue . We utilize an . abstract representation describing , for instance , thematic information and dialogue acts of the dialogue utterances . From this representation we generate simplified paraphrases of the individual turns of the dialogue which together make up the protocol . Instead of writing completely new software , the protocol generation component is almost exclusively composed of already existing modules in the system which are extended by planning and formatting routines for protocol formulations . We describe how the abstract information is extracted from user utterances in different languages and how the abstract thematic representation is used to generate a protocol in one specific language . Future directions are given . ##P06-1017 Relation Extraction Using Label Propagation Based Semi-Supervised Learning . Shortage of manually labeled data is an obstacle to supervised relation extraction methods . In this paper we investigate a graph based semi-supervised learning algorithm , a label propagation -LRB- LP -RRB- algorithm , for relation extraction . It represents labeled and unlabeled examples and their distances as the nodes and the weights of edges of a graph , and tries to obtain a labeling function to satisfy two constraints : 1 -RRB- it should be fixed on the labeled nodes , 2 -RRB- it should be smooth on the whole graph . Experiment results on the ACE corpus showed that this LP algorithm achieves better performance than SVM when only very few labeled examples are available , and it also performs better than bootstrapping for the relation extraction task . ##W00-0712 Knowledge-Free Induction Of Morphology Using Latent Semantic Analysis . Morphology induction is a subproblem of important tasks like automatic learning of machine-readable dictionaries and grammar induction . Previous morphology induction approaches have relied solely on statistics of hypothesized stems and affixes to choose which affixes to consider legitimate . Relying on stemand-affix statistics rather than semantic knowledge leads to a number of problems , such as the inappropriate use of valid affixes -LRB- `` ally '' stemming to `` all '' -RRB- . We introduce a semantic-based algorithm for learning morphology which only proposes affixes when the stem and stem-plusaffix are sufficiently similar semantically . We implement our approach using Latent Semantic Analysis and show that our semantics-only approach provides morphology induction results that rival a current state-of-the-art system . ##N09-1017 The Role of Implicit Argumentation in Nominal SRL . Nominals frequently surface without overtly expressed arguments . In order to measure the potential benefit of nominal SRL for downstream processes , such nominals must be accounted for . In this paper , we show that a state-of-the-art nominal SRL system with an overall argument F1 of 0.76 suffers a performance loss of more than 9 % when nominals with implicit arguments are included in the evaluation . We then develop a system that takes implicit argumentation into account , improving overall performance by nearly 5 % . Our results indicate that the degree of implicit argumentation varies widely across nominals , making automated detection of implicit argumentation an important step for nominal SRL . ##W94-0108 The Automatic Construction Of A Symbolic Parser Via Statistical Techniques . We report on the development of a robust parsing device which aims to provide a partial explanation for child language acquisition and help in the construction of better natural language processing systems . The backbone of the new approach is the synthesis of statistical and symbolic approaches to natural language . Motivation We report on the progress we have made towards developing a robust ` self-constructing ' parsing device that uses indirect negative evidence -LRB- Kapur , ##W97-0124 Analysis Of Unknown Lexical Items Using Morphological And Syntactic Information With The TIMIT Corpus . The importance of dealing with unknown words in Natural Language Processing -LRB- NLP -RRB- is growing as NLP systems are used in more and more applications . One aid in predicting the lexical class of words that do not appear in the lexicon -LRB- referred to as unknown words -RRB- is the use of syntactic parsing rules . The distinction between closed-class and open-class words together with morphological recognition appears to be pivotal in increasing the ability of the system to predict the lexical categories of unknown words . An experiment is performed to investigate the ability of a parser to parse unknown words using morphology and syntactic parsing rules without human intervention . This experiment shows that the performance of the parser is enhanced greatly when morphological recognition is used in conjunction with syntactic rules to parse sentences containing unknown words from the TIMIT corpus . ##A88-1006 From Water To Wine : Generating Natural Language Text From Today 's Applications Programs . In this paper we present a means of compensating for the semantic deficits of linguistically naive underlying application programs without compromising principled grammatical treatments in natural language generation . We present a method for building an interface from today 's underlying application programs to the linguistic realization component Mumble-86 . The goal of the paper is not to discuss how Mumble works , but to describe how one exploits its capabilities . We provide examples from current generation projects using Mumble as their linguistic component . ##P07-1039 Bootstrapping Word Alignment via Word Packing . We introduce a simple method to pack words for statistical word alignment . Our goal is to simplify the task of automatic word alignment by packing several consecutive words together when we believe they correspond to a single word in the opposite language . This is done using the word aligner itself , i.e. by bootstrapping on its output . We evaluate the performance of our approach on a Chinese-to-English machine translation task , and report a 12.2 % relative increase in BLEU score over a state-of-the art phrasebased SMT system . ##J99-4005 Decoding Complexity In Word-Replacement Translation Models . Statistical machine translation is a relatively new approach to the long-standing problem of translating human languages by computer . Current statistical techniques uncover translation rules from bilingual training texts and use those rules to translate new texts . The general architecture is the source-channel model : an English string is statistically generated -LRB- source -RRB- , then statistically transformed into French -LRB- channel -RRB- . In order to translate -LRB- or `` decode '' -RRB- a French string , we look for the most likely English source . We show that for the simplest form of statistical models , this problem is NP-complete , i.e. , probably exponential in the length of the observed sentence . We trace this complexity to factors not present in other decoding problems . ##W96-0412 An Evaluation Of Anaphor Generation In Chinese . In this paper , we present an evaluation of anaphors generated by a Chinese natural language generation system . In the evaluation work , the anaphors in five test texts generated by three test systems employing generation rules with different complexities ~ vere compared with the ones in the same texts created by twelve native speakers of Chinese . We took the average number of anaphors matching between the machine and human texts as a measure of the quality of anaphors generated by the test systems . The results suggest that the one we have chosen and which has the most complex rule is better than the other two . There axe , however , real difficulties in establishing the significance of the results because of the degree of disagreement among the native speakers . ##C94-1082 LHIP : Extended DCGs For Configurable Robust Parsing . We present LHIP , a system for incremental grammar development using an extended DCG formalism . ` rite system uses a robust island-based parsing method controlled by user-defined performance thresholds . Keywords : I -RRB- CG , head , island parsing , robust parsing , Prolog ##P96-1014 Computing Optimal Descriptions For Optimality Theory Grammars With Context-Free Position Structures . This paper describes an algorithm for computing optimal structural descriptions for Optimality Theory grammars with context-free position structures . This algorithm extends Tesar 's dynamic programming approach -LRB- Tesar , 1994 -RRB- -LRB- Tesar , 1995 @ to computing optimal structural descriptions from regular to context-free structures . The generalization to contextfree structures creates several complications , all of which are overcome without compromising the core dynamic programming approach . The resulting algorithm has a time complexity cubic in the length of the input , and is applicable to grammars with universal constraints that exhibit context-free locality . ##W05-0801 Association-Based Bilingual Word Alignment . Bilingual word alignment forms the foundation of current work on statistical machine translation . Standard wordalignment methods involve the use of probabilistic generative models that are complex to implement and slow to train . In this paper we show that it is possible to approach the alignment accuracy of the standard models using algorithms that are much faster , and in some ways simpler , based on basic word-association statistics . ##W04-0851 Regularized Least-Squares Classification For Word Sense Disambiguation . The paper describes RLSC-LIN and RLSCCOMB systems which participated in the Senseval-3 English lexical sample task . These systems are based on Regularized Least-Squares Classification -LRB- RLSC -RRB- learning method . We describe the reasons of choosing this method , how we applied it to word sense disambiguation , what results we obtained on Senseval1 , Senseval-2 and Senseval-3 data and discuss some possible improvements . ##P09-1053 Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition . We present a novel approach to deciding whether two sentences hold a paraphrase relationship . We employ a generative model that generates a paraphrase of a given sentence , and we use probabilistic inference to reason about whether two sentences share the paraphrase relationship . The model cleanly incorporates both syntax and lexical semantics using quasi-synchronous dependency grammars -LRB- Smith and Eisner , 2006 -RRB- . Furthermore , using a product of experts -LRB- Hinton , 2002 -RRB- , we combine the model with a complementary logistic regression model based on state-of-the-art lexical overlap features . We evaluate our models on the task of distinguishing true paraphrase pairs from false ones on a standard corpus , giving competitive state-of-the-art performance . ##P03-1018 Orthogonal Negation In Vector Spaces For Modelling Word-Meanings And Document Retrieval . Standard IR systems can process queries such as `` web NOT internet '' , enabling users who are interested in arachnids to avoid documents about computing . The documents retrieved for such a query should be irrelevant to the negated query term . Most systems implement this by reprocessing results after retrieval to remove documents containing the unwanted string of letters . This paper describes and evaluates a theoretically motivated method for removing unwanted meanings directly from the original query in vector models , with the same vector negation operator as used in quantum logic . Irrelevance in vector spaces is modelled using orthogonality , so query vectors are made orthogonal to the negated term or terms . As well as removing unwanted terms , this form of vector negation reduces the occurrence of synonyms and neighbors of the negated terms by as much as 76 % compared with standard Boolean methods . By altering the query vector itself , vector negation removes not only unwanted strings but unwanted meanings . ##W96-0103 Hierarchical Clustering Of Words And Application To NLP Tasks . This paper describes a data-driven method for hierarchical clustering of words and clustering of multiword compounds . A large vocabulary of English words -LRB- 70,000 words -RRB- is clustered bottom-up , with respect to corpora ranging in size from 5 million to 50 million words , using mutual information as an objective function . The resulting hierarchical clusters of words are then naturally transformed to a bit-string representation of -LRB- i.e. word bits for -RRB- all the words in the vocabulary . Evaluation of the word bits is carried out through the measurement of the error rate of the ATR Decision-Tree Part-Of-Speech Tagger . The same clustering technique is then applied to the classification of multiword compounds . In order to avoid the explosion of the number of compounds to be handled , compounds in a small subclass are bundled and treated as a single compound . Another merit of this approach is that we can avoid the data sparseness problem which is ubiquitous in corpus statistics . The quality of one of the obtained compound classes is examined and compared to a conventional approach . ##P03-1065 An Expert Lexicon Approach To Identifying English Phrasal Verbs . Phrasal Verbs are an important feature of the English language . Properly identifying them provides the basis for an English parser to decode the related structures . Phrasal verbs have been a challenge to Natural Language Processing -LRB- NLP -RRB- because they sit at the borderline between lexicon and syntax . Traditional NLP frameworks that separate the lexicon module from the parser make it difficult to handle this problem properly . This paper presents a finite state approach that integrates a phrasal verb expert lexicon between shallow parsing and deep parsing to handle morpho-syntactic interaction . With precision\/recall combined performance benchmarked consistently at 95.8 % -97.5 % , the Phrasal Verb identification problem has basically been solved with the presented method . ##P90-1024 Zero Morphemes In Unification-Based Combinatory Categorial Grammar . In this paper , we report on our use of zero morphemes in Unification-Based Combinatory Categorial Grammar . After illustrating the benefits of this approach with several examples , we describe the algorithm for compiling zero morphemes into unary rules , which allows us to use zero morphemes more efficiently in natural language processing . 1 Then , we discuss the question of equivalence of a grammar with these unary rules to the original grammar . Lastly , we compare our approach to zero morphemes with possible alternatives . ##W07-1417 The Role of Sentence Structure in Recognizing Textual Entailment . Recent research suggests that sentence structure can improve the accuracy of recognizing textual entailments and paraphrasing . Although background knowledge such as gazetteers , WordNet and custom built knowledge bases are also likely to improve performance , our goal in this paper is to characterize the syntactic features alone that aid in accurate entailment prediction . We describe candidate features , the role of machine learning , and two final decision rules . These rules resulted in an accuracy of 60.50 and 65.87 % and average precision of 58.97 and 60.96 % in RTE3 Test and suggest that sentence structure alone can improve entailment accuracy by 9.25 to 14.62 % over the baseline majority class . ##D07-1065 Recovery of Empty Nodes in Parse Structures . In this paper , we describe a new algorithm for recovering WH-trace empty nodes . Our approach combines a set of hand-written patterns together with a probabilistic model . Because the patterns heavily utilize regular expressions , the pertinent tree structures are covered using a limited number of patterns . The probabilistic model is essentially a probabilistic context-free grammar -LRB- PCFG -RRB- approach with the patterns acting as the terminals in production rules . We evaluate the algorithm 's performance on gold trees and parser output using three different metrics . Our method compares favorably with state-of-the-art algorithms that recover WH-traces . ##P00-1070 Importance Of Pronominal Anaphora Resolution In Question Answering Systems . The main aim of this paper is to analyze the e # 0Bects of applying pronominal anaphora resolution to Question Answering # 28QA # 29 systems . For this task a complete QA system has been implemented . System evaluation measures performance improvements obtained when information that is referenced anaphorically in documents is not ignored . ##P04-1080 Learning Word Sense With Feature Selection And Order Identification Capabilities . This paper presents an unsupervised word sense learning algorithm , which induces senses of target word by grouping its occurrences into a `` natural '' number of clusters based on the similarity of their contexts . For removing noisy words in feature set , feature selection is conducted by optimizing a cluster validation criterion subject to some constraint in an unsupervised manner . Gaussian mixture model and Minimum Description Length criterion are used to estimate cluster structure and cluster number . Experimental results show that our algorithm can find important feature subset , estimate model order -LRB- cluster number -RRB- and achieve better performance than another algorithm which requires cluster number to be provided . ##W07-2216 On the Complexity of Non-Projective Data-Driven Dependency Parsing . In this paper we investigate several nonprojective parsing algorithms for dependency parsing , providing novel polynomial time solutions under the assumption that each dependency decision is independent of all the others , called here the edge-factored model . We also investigate algorithms for non-projective parsing that account for nonlocal information , and present several hardness results . This suggests that it is unlikely that exact non-projective dependency parsing is tractable for any model richer than the edge-factored model . ##P05-2008 Using Emoticons To Reduce Dependency In Machine Learning Techniques For Sentiment Classification . Sentiment Classification seeks to identify a piece of text according to its author 's general feeling toward their subject , be it positive or negative . Traditional machine learning techniques have been applied to this problem with reasonable success , but they have been shown to work well only when there is a good match between the training and test data with respect to topic . This paper demonstrates that match with respect to domain and time is also important , and presents preliminary experiments with training data labeled with emoticons , which has the potential of being independent of domain , topic and time . ##P08-1016 Lexicalized Phonotactic Word Segmentation . This paper presents a new unsupervised algorithm -LRB- WordEnds -RRB- for inferring word boundaries from transcribed adult conversations . Phone ngrams before and after observed pauses are used to bootstrap a simple discriminative model of boundary marking . This fast algorithm delivers high performance even on morphologically complex words in English and Arabic , and promising results on accurate phonetic transcriptions with extensive pronunciation variation . Expanding training data beyond the traditional miniature datasets pushes performance numbers well above those previously reported . This suggests that WordEnds is a viable model of child language acquisition and might be useful in speech understanding . ##N09-1010 Adding More Languages Improves Unsupervised Multilingual Part-of-Speech Tagging : a Bayesian Non-Parametric Approach . We investigate the problem of unsupervised part-of-speech tagging when raw parallel data is available in a large number of languages . Patterns of ambiguity vary greatly across languages and therefore even unannotated multilingual data can serve as a learning signal . We propose a non-parametric Bayesian model that connects related tagging decisions across languages through the use of multilingual latent variables . Our experiments show that performance improves steadily as the number of languages increases . ##P07-1111 A Re-examination of Machine Learning Approaches for Sentence-Level MT Evaluation . Recent studies suggest that machine learning can be applied to develop good automatic evaluation metrics for machine translated sentences . This paper further analyzes aspects of learning that impact performance . We argue that previously proposed approaches of training a HumanLikeness classifier is not as well correlated with human judgments of translation quality , but that regression-based learning produces more reliable metrics . We demonstrate the feasibility of regression-based metrics through empirical analysis of learning curves and generalization studies and show that they can achieve higher correlations with human judgments than standard automatic metrics . ##D07-1126 A Multilingual Dependency Analysis System Using Online Passive-Aggressive Learning . This paper presents an online algorithm for dependency parsing problems . We propose an adaptation of the passive and aggressive online learning algorithm to the dependency parsing domain . We evaluate the proposed algorithms on the 2007 CONLL Shared Task , and report errors analysis . Experimental results show that the system score is better than the average score among the participating systems . ##N07-3003 Creating a Knowledge Base from a Collaboratively Generated Encyclopedia . We present our work on using Wikipedia as a knowledge source for Natural Language Processing . We first describe our previous work on computing semantic relatedness from Wikipedia , and its application to a machine learning based coreference resolution system . Our results suggest that Wikipedia represents a semantic resource to be treasured for NLP applications , and accordingly present the work directions to be explored in the future . ##W93-0105 Identifying Unknown Proper Names In Newswire Text . The identification of unknown proper names in text is a significant challenge for NLP systems operating on unrestricted text . A system which indexes documents according to name references can be useful for information retrieval or as a preprocessor for more knowledge intensive tasks such as database extraction . This paper describes a system which uses text skimming techniques for deriving proper names and their semantic attributes automatically from newswire text , without relying on any listing of name elements . In order to identify new names , the system treats proper names as -LRB- potentially -RRB- context-dependent linguistic expressions . In addition to using information in the local context , the system exploits a computational model of discourse which identifies individuals based on the way they are described in the text , instead of relying on their description in a pre-existing knowledge base . ##P09-2033 English-Chinese Bi-Directional OOV Translation based on Web Mining and Supervised Learning . In Cross-Language Information Retrieval -LRB- CLIR -RRB- , Out-of-Vocabulary -LRB- OOV -RRB- detection and translation pair relevance evaluation still remain as key problems . In this paper , an English-Chinese Bi-Directional OOV translation model is presented , which utilizes Web mining as the corpus source to collect translation pairs and combines supervised learning to evaluate their association degree . The experimental results show that the proposed model can successfully filter the most possible translation candidate with the lower computational cost , and improve the OOV translation ranking effect , especially for popular new words . ##W08-0302 Rich Source-Side Context for Statistical Machine Translation . We explore the augmentation of statistical machine translation models with features of the context of each phrase to be translated . This work extends several existing threads of research in statistical MT , including the use of context in example-based machine translation -LRB- Carl and Way , 2003 -RRB- and the incorporation of word sense disambiguation into a translation model -LRB- Chan et al. , 2007 -RRB- . The context features we consider use surrounding words and part-of-speech tags , local syntactic structure , and other properties of the source language sentence to help predict each phrase 's translation . Our approach requires very little computation beyond the standard phrase extraction algorithm and scales well to large data scenarios . We report significant improvements in automatic evaluation scores for Chineseto-EnglishandEnglish-to-Germantranslation , and also describe our entry in the WMT-08 shared task based on this approach . ##P06-2032 Coreference Handling In XMG . We claim that existing specification languages for tree based grammars fail to adequately support identifier managment . We then show that XMG -LRB- eXtensible MetaGrammar -RRB- provides a sophisticated treatment of identifiers which is effective in supporting a linguist-friendly grammar design . ##N06-1057 ParaEval : Using Paraphrases To Evaluate Summaries Automatically . ParaEval is an automated evaluation method for comparing reference and peer summaries . It facilitates a tieredcomparison strategy where recall-oriented global optimal and local greedy searches for paraphrase matching are enabled in the top tiers . We utilize a domainindependent paraphrase table extracted from a large bilingual parallel corpus using methods from Machine Translation -LRB- MT -RRB- . We show that the quality of ParaEval 's evaluations , measured by correlating with human judgments , closely resembles that of ROUGE 's . ##W09-0805 Unsupervised Concept Discovery In Hebrew Using Simple Unsupervised Word Prefix Segmentation for Hebrew and Arabic . Fully unsupervised pattern-based methods for discovery of word categories have been proven to be useful in several languages . The majority of these methods rely on the existence of function words as separate text units . However , in morphology-rich languages , in particular Semitic languages such as Hebrew and Arabic , the equivalents of such function words are usually written as morphemes attached as prefixes to other words . As a result , they are missed by word-based pattern discovery methods , causing many useful patterns to be undetected and a drastic deterioration in performance . To enable high quality lexical category acquisition , we propose a simple unsupervised word segmentation algorithm that separates these morphemes . We study the performance of the algorithm for Hebrew and Arabic , and show that it indeed improves a state-of-art unsupervised concept acquisition algorithm in Hebrew . ##N07-1065 Automatic Answer Typing for How-Questions . We introduce an answer typing strategy specific to quantifiable how questions . Using the web as a data source , we automatically collect answer units appropriate to a given how-question type . Experimental results show answer typing with these units outperforms traditional fixedcategory answer typing and other strategies based on the occurrences of numerical entities in text . ##C08-1117 Using Three Way Data for Word Sense Discrimination . In this paper , an extension of a dimensionality reduction algorithm called NONNEGATIVE MATRIX FACTORIZATION is presented that combines both ` bag of words ' data and syntactic data , in order to find semantic dimensions according to which both words and syntactic relations can be classified . The use of three way data allows one to determine which dimension -LRB- s -RRB- are responsible for a certain sense of a word , and adapt the corresponding feature vector accordingly , ` subtracting ' one sense to discover another one . The intuition in this is that the syntactic features of the syntax-based approach can be disambiguated by the semantic dimensions found by the bag of words approach . The novel approach is embedded into clustering algorithms , to make it fully automatic . The approach is carried out for Dutch , and evaluated against EuroWordNet . ##I08-2131 Language Independent Text Correction using Finite State Automata . Many natural language applications , like machine translation and information extraction , are required to operate on text with spelling errors . Those spelling mistakes have to be corrected automatically to avoid deteriorating the performance of such applications . In this work , we introduce a novel approach for automatic correction of spelling mistakes by deploying finite state automata to propose candidates corrections withinaspecifiededitdistancefromthemisspelled word . After choosing candidate corrections , a language model is used to assign scores the candidate corrections and choose best correction in the given context . The proposed approach is language independent and requires only a dictionary and text data for building a language model . The approach have been tested on both Arabic and English text and achieved accuracy of 89 % . ##J85-4001 On The Complexity Of ID\/LP Parsing . Modern linguistic theory attributes surface complexity to interacting subsystems of constraints . For instance , the ID\/LP grammar formalism separates constraints on immediate dominance from those on linear order . An ID\/LP parsing algorithm by Shieber shows how to use ID and LP constraints directly in language processing , without expanding them into an intermediate context-free `` object grammar '' . However , Shieber 's purported runtime bound underestimates the difficulty of ID\/LP parsing . ID\/LP parsing is actually NP-complete , and the worst-case runtime of Shieber 's algorithm is actually exponential in grammar size . The growth of parser data structures causes the difficulty . Some computational and linguistic implications follow ; in particular , it is important to note that , desplte its potential for combinatorial explosion , Shieber 's algorithm remains better than the alternative of parsing an expanded object grammar . ##P00-1037 An Improved Error Model For Noisy Channel Spelling Correction . The noisy channel model has been applied to a wide range of problems , including spelling correction . These models consist of two components : a source model and a channel model . Very little research has gone into improving the channel model for spelling correction . This paper describes a new channel model for spelling correction , based on generic string to string edits . Using this model gives significant performance improvements compared to previously proposed models . ##W06-0906 Extending TimeML With Typical Durations Of Events . In this paper , we demonstrate how to extend TimeML , a rich specification language for event and temporal expressions in text , with the implicit typical durations of events , temporal information in text that has hitherto been largely unexploited . Event duration information can be very important in applications in which the time course of events is to be extracted from text . For example , whether two events overlap or are in sequence often depends very much on their durations . ##N03-2027 Bayesian Nets For Syntactic Categorization Of Novel Words . This paper presents an application of a Dynamic Bayesian Network -LRB- DBN -RRB- to the task of assigning Part-of-Speech -LRB- PoS -RRB- tags to novel text . This task is particularly challenging for non-standard corpora , such as Internet lingo , where a large proportion of words are unknown . Previous work reveals that PoS tags depend on a variety of morphological and contextual features . Representing these dependencies in a DBN results into an elegant and effective PoS tagger . ##C94-2191 An Integrated Model For Anaphora Resolution . The paper discusses a new knowledgebased and sublanguage-oriented model for anaphora resolution , which integrates syntactic , semantic , discourse , domain and heuristical knowledge for the sublanguage of computer science . Special attention is paid to a new approach for tracking the center throughout a discourse segment , which plays an imtx ~ rtant role in proposing the most likely antecedent to the anaphor in case of ambiguity . ##P04-1058 Alternative Approaches For Generating Bodies Of Grammar Rules . We compare two approaches for describing and generating bodies of rules used for natural language parsing . In today 's parsers rule bodies do not exist a priori but are generated on the fly , usually with methods based on n-grams , which are one particular way of inducing probabilistic regular languages . We compare two approaches for inducing such languages . One is based on n-grams , the other on minimization of the Kullback-Leibler divergence . The inferred regular languages are used for generating bodies of rules inside a parsing procedure . We compare the two approaches along two dimensions : the quality of the probabilistic regular language they produce , and the performance of the parser they were used to build . The second approach outperforms the first one along both dimensions . ##P08-2027 A Unified Syntactic Model for Parsing Fluent and Disfluent Speech . This paper describes a syntactic representation for modeling speech repairs . This representation makes use of a right corner transform of syntax trees to produce a tree representation in which speech repairs require very few special syntax rules , making better use of training data . PCFGs trained on syntax trees using this model achieve high accuracy on the standard Switchboard parsing task . ##C96-1041 Markov Random Field Based English Part-Of-Speech Tagging System . Probabilistic models have been widely used for natural language processing . Part-of-speech tagging , which assigns the most likely tag to each word in a given sentence , is one . of tire problems which can be solved by statisticM approach . Many researchers haw ~ tried to solve the problem by hidden Marker model -LRB- HMM -RRB- , which is well known as one of the statistical models . But it has many difficulties : integrating heterogeneous information , coping with data sparseness prohlem , and adapting to new environments . In this paper , we propose a Markov radom field -LRB- MRF -RRB- model based approach to the tagging problem . The MRF provides the base frame to combine various statistical information with maximum entropy -LRB- ME -RRB- method . As Gibbs distribution can be used to describe a posteriori probability of tagging , we use it in ma . ximum a posteriori -LRB- MAP -RRB- estimation of optimizing process . Besides , several tagging models are developed to show the effect of adding information . Experimental results show that the performance of the tagger gets improved as we add more statistical information , and that Mt -LCB- F-based tagging model is better than ttMM based tagging model in data sparseness problem . ##W04-3228 Dependencies Vs. Constituents For Tree-Based Alignment . Given a parallel parsed corpus , statistical treeto-tree alignment attempts to match nodes in the syntactic trees for a given sentence in two languages . We train a probabilistic tree transduction model on a large automatically parsed Chinese-English corpus , and evaluate results against human-annotated word level alignments . We find that a constituent-based model performs better than a similar probability model trained on the same trees converted to a dependency representation . ##C00-2117 Text Genre Detection Using Common Word Frequencies . In this paper we present a method for detecting the text genre quickly and easily following an approach originally proposed in authorship attribution studies which uses as style markers the frequencies of occurrence of the most frequent words in a training corpus -LRB- Burrows , 1992 -RRB- . In contrast to this approach we use the frequencies of occurrence of the most frequent words of the entire written language . Using as testing ground a part of the Wall Street Journal corpus , we show that the most frequent words of the British National Corpus , representing the most frequent words of the written English language , are more reliable discriminators of text genre in comparison to the most frequent words of the training corpus . Moreover , the fi'equencies of occurrence of the most common punctuation marks play an important role in terms of accurate text categorization as well as when dealing with training data of limited size . ##H91-1049 A Dynamical System Approach To Continuous Speech Recognition . A dynamical system model is proposed for better representing the spectral dynamics of speech for recognition . We assume that the observed feature vectors of a phone segment are the output of a stochastic linear dynamical system and consider two alternative assumptions regarding the relationship of the segment length and the evolution of the dynamics . Training is equivalent to the identification of a stochastic linear system , and we follow a nontraditional approach based on the Estlmate-Maximize algorithm . We evaluate this model on a phoneme classification task using the TIMIT database . ##C04-1021 Modern Natural Language Interfaces To Databases : Composing Statistical Parsing With Semantic Tractability . Natural Language Interfaces to Databases -LRB- NLIs -RRB- can benefit from the advances in statistical parsing over the last fifteen years or so . However , statistical parsers require training on a massive , labeled corpus , and manually creating such a corpus for each database is prohibitively expensive . To address this quandary , this paper reports on the PRECISE NLI , which uses a statistical parser as a `` plug in '' . The paper shows how a strong semantic model coupled with `` light re-training '' enables PRECISE to overcome parser errors , and correctly map from parsed questions to the corresponding SQL queries . We discuss the issues in using statistical parsers to build database-independent NLIs , and report on experimental results with the benchmark ATIS data set where PRECISE achieves 94 % accuracy . ##W99-0706 Learning Transformation Rules To Find Grammatical Relations . Grammatical relationships are an important level of natural language processing . We present a trainable approach to find these relationships through transformation sequences and-error-driven learning . Our approach finds grammatical relationships between core syntax groups and bypasses much of the parsing phase . On our training and test set , our procedure achieves 63.6 % recall and 77.3 % precision -LRB- f-score = 69.8 -RRB- . ##N04-2002 Identifying Chemical Names In Biomedical Text : An Investigation Of Substring Co-Occurrence Based Approaches . We investigate various strategies for finding chemicals in biomedical text using substring co-occurrence information . The goal is to build a system from readily available data with minimal human involvement . Our models are trained from a dictionary of chemical names and general biomedical text . We investigated several strategies including Naïve Bayes classifiers and several types of N-gram models . We introduced a new way of interpolating N-grams that does not require tuning any parameters . We also found the task to be similar to Language Identification . ##A00-2039 Finite-State Reduplication In One-Level Prosodic Morphology . Reduplication , a central instance of prosodic morphology , is particularly challenging for state-ofthe-art computational morphology , since it involves copying of some part of a phonological string . In this paper I advocate a finite-state method that combines enriched lexical representations via intersection to implement the copying . The proposal includes a resource-conscious variant of automata and can benefit from the existence of lazy algorithms . Finally , the implementation of a complex case from Koasati is presented . ##P98-2170 A Procedure for Multi-Class Discrimination and some Linguistic Applications . The paper describes a novel computational tool for multiple concept learning . Unlike previous approaches , whose major goal is prediction on unseen instances rather than the legibility of the output , our MPD -LRB- Maximally Parsimonious Discrimination -RRB- program emphasizes the conciseness and intelligibility of the resultant class descriptions , using three intuitive simplicity criteria to this end . We illustrate MPD with applications in componential analysis -LRB- in lexicology and phonology -RRB- , language typology , and speech pathology . ##I05-1083 An Empirical Study on Language Model Adaptation Using a Metric of Domain Similarity . This paper presents an empirical study on four techniques of language model adaptation , including a maximum a posteriori -LRB- MAP -RRB- method and three discriminative training models , in the application of Japanese Kana-Kanji conversion . We compare the performance of these methods from various angles by adapting the baseline model to four adaptation domains . In particular , we attempt to interpret the results given in terms of the character error rate -LRB- CER -RRB- by correlating them with the characteristics of the adaptation domain measured using the information-theoretic notion of cross entropy . We show that such a metric correlates well with the CER performance of the adaptation methods , and also show that the discriminative methods are not only superior to a MAP-based method in terms of achieving larger CER reduction , but are also more robust against the similarity of background and adaptation domains . ##N07-2041 Simultaneous Identification of Biomedical Named-Entity and Functional Relation Using Statistical Parsing Techniques . In this paper we propose a statistical parsing technique that simultaneously identifies biomedical named-entities -LRB- NEs -RRB- and extracts subcellular localization relations for bacterial proteins from the text in MEDLINE articles . We build a parser that derives both syntactic and domain-dependent semantic information and achieves an F-score of 48.4 % for the relation extraction task . We then propose a semi-supervised approach that incorporates noisy automatically labeled data to improve the F-score of our parser to 83.2 % . Our key contributions are : learning from noisy data , and building an annotated corpus that can benefit relation extraction research . 1 Introduction Relation extraction from text is a step beyond Named-Entity Recognition -LRB- NER -RRB- and generally demands adequate domain knowledge to build relations among domain-specific concepts . A Biomedical Functional Relation -LRB- relation for short -RRB- states interactions among biomedical substances . In this paper we focus on one such relation : Bacterial Protein Localization -LRB- BPL -RRB- , and introduce our approach for identifying BPLs from MEDLINE1 articles . BPL is a key functional characteristic of proteins . It is essential to the understanding of the function of different proteins and the discovery of suitable drugs , vaccines and diagnostic targets . We are collaborating with researchers in molecular biology with the goal of automatically extracting BPLs from ∗ This research was partially supported by NSERC , Canada . ##W04-1808 Discovering Synonyms And Other Related Words . Discovering synonyms and other related words among the words in a document collection can be seen as a clustering problem , where we expect the words in a cluster to be closely related to one another . The intuition is that words occurring in similar contexts tend to convey similar meaning . We introduce a way to use translation dictionaries for several languages to evaluate the rate of synonymy found in the word clusters . We also apply the information radius to calculating similarities between words using a full dependency syntactic feature space , and introduce a method for similarity recalculation during clustering as a fast approximation of the high-dimensional feature space . Finally , we show that 69-79 % of the words in the clusters we discover are useful for thesaurus construction . ##I05-1086 A Case-Based Reasoning Approach for Speech Corpus Generation . Corpus-based stochastic language models have achieved significant success in speech recognition , but construction of a corpus pertaining to a specific application is a difficult task . This paper introduces a Case-Based Reasoning system to generate natural language corpora . In comparison to traditional natural language generation approaches , this system overcomes the inflexibility of template-based methods while avoiding the linguistic sophistication of rule-based packages . The evaluation of the system indicates our approach is effective in generating users ' specifications or queries as 98 % of the generated sentences are grammatically correct . The study result also shows that the language model derived from the generated corpus can significantly outperform a general language model or a dictation grammar . ##W08-0120 A Frame-Based Probabilistic Framework for Spoken Dialog Management Using Dialog Examples . This paper proposes a probabilistic framework for spoken dialog management using dialog examples . To overcome the complexity problems of the classic partially observable Markov decision processes -LRB- POMDPs -RRB- based dialog manager , we use a frame-based belief state representation that reduces the complexity of belief update . We also used dialog examples to maintain a reasonable number of system actions to reduce the complexity of the optimizing policy . We developed weather information and car navigation dialog system that employed a frame-based probabilistic framework . This framework enables people to develop a spoken dialog system using a probabilistic approach without complexity problem of POMDP . ##C02-1064 Text Generation From Keywords . We describe a method for generating sentences from `` keywords '' or `` headwords '' . This method consists of two main parts , candidate-text construction and evaluation . The construction part generates text sentences in the form of dependency trees by using complementary information to replace information that is missing because of a `` knowledge gap '' and other missing function words to generate natural text sentences based on a particular monolingual corpus . The evaluation part consists of a model for generating an appropriate text when given keywords . This model considers not only word n-gram information , but also dependency information between words . Furthermore , it considers both string information and morphological information . ##P04-1075 Multi-Criteria-Based Active Learning For Named Entity Recognition . In this paper , we propose a multi-criteria based active learning approach and effectively apply it to named entity recognition . Active learning targets to minimize the human annotation efforts by selecting examples for labeling . To maximize the contribution of the selected examples , we consider the multiple criteria : informativeness , representativeness and diversity and propose measures to quantify them . More comprehensively , we incorporate all the criteria using two selection strategies , both of which result in less labeling cost than single - criterion-based method . The results of the named entity recognition in both MUC-6 and GENIA show that the labeling cost can be reduced by at least 80 % without degrading the performance . ##P06-2071 Discriminating Image Senses By Clustering With Multimodal Features . We discuss Image Sense Discrimination -LRB- ISD -RRB- , and apply a method based on spectral clustering , using multimodal features from the image and text of the embedding web page . We evaluate our method on a new data set of annotated web images , retrieved with ambiguous query terms . Experiments investigate different levels of sense granularity , as well as the impact of text and image features , and global versus local text features . ##P98-1066 A Layered Approach to NLP-Based Information Retrieval . A layered approach to information retrieval permits the inclusion of multiple search engines as well as multiple databases , with a natural language layer to convert English queries for use by the various search engines . The NLP layer incorporates morphological analysis , noun phrase syntax , and semantic expansion based on WordNet . ##P07-2049 Measuring Importance and Query Relevance in Topic-focused Multi-document Summarization . The increasing complexity of summarization systems makes it difficult to analyze exactly which modules make a difference in performance . We carried out a principled comparison between the two most commonly used schemes for assigning importance to words in the context of query focused multi-document summarization : raw frequency -LRB- word probability -RRB- and log-likelihood ratio . We demonstrate that the advantages of log-likelihood ratio come from its known distributional properties which allow for the identification of a set of words that in its entirety defines the aboutness of the input . We also find that LLR is more suitable for query-focused summarization since , unlike raw frequency , it is more sensitive to the integration of the information need defined by the user . ##W97-1202 Message-To-Speech : High Quality Speech Generation For Messaging And Dialogue Systems . In this paper , we present a Message-toSpeech -LRB- MTS -RRB- system that offers the linguistic flexibility desired for spoken dialogue and message generating systems . The use of prosody transplantation and special purpose prosody models results in highly natural prosody for the synthesised speech . ##P04-1088 FLSA : Extending Latent Semantic Analysis With Features For Dialogue Act Classification . We discuss Feature Latent Semantic Analysis -LRB- FLSA -RRB- , an extension to Latent Semantic Analysis -LRB- LSA -RRB- . LSA is a statistical method that is ordinarily trained on words only ; FLSA adds to LSA the richness of the many other linguistic features that a corpus may be labeled with . We applied FLSA to dialogue act classification with excellent results . We report results on three corpora : CallHome Spanish , MapTask , and our own corpus of tutoring dialogues . ##P04-1014 Parsing The WSJ Using CCG And Log-Linear Models . This paper describes and evaluates log-linear parsing models for Combinatory Categorial Grammar -LRB- CCG -RRB- . A parallel implementation of the L-BFGS optimisation algorithm is described , which runs on a Beowulf cluster allowing the complete Penn Treebank to be used for estimation . We also develop a new efficient parsing algorithm for CCG which maximises expected recall of dependencies . We compare models which use all CCG derivations , including nonstandard derivations , with normal-form models . The performances of the two models are comparable and the results are competitive with existing wide-coverage CCG parsers . ##C02-1070 Inducing Information Extraction Systems For New Languages Via Cross-Language Projection . Information extraction -LRB- IE -RRB- systems are costly to build because they require development texts , parsing tools , and specialized dictionaries for each application domain and each natural language that needs to be processed . We present a novel method for rapidly creating IE systems for new languages by exploiting existing IE systems via crosslanguage projection . Given an IE system for a source language -LRB- e.g. , English -RRB- , we can transfer its annotations to corresponding texts in a target language -LRB- e.g. , French -RRB- and learn information extraction rules for the new language automatically . In this paper , we explore several ways of realizing both the transfer and learning processes using off-theshelf machine translation systems , induced word alignment , attribute projection , and transformationbased learning . We present a variety of experiments that show how an English IE system for a plane crash domain can be leveraged to automatically create a French IE system for the same domain . ##W09-0203 Unsupervised Classification with Dependency Based Word Spaces . We present the results of clustering experiments with a number of different evaluation sets using dependency based word spaces . Contrary to previous results we found a clear advantage using a parsed corpus over word spaces constructed with the help of simple patterns . We achieve considerable gains in performance over these spaces ranging between 9 and 13 % in absolute terms of cluster purity . ##D07-1102 Log-Linear Models of Non-Projective Trees , $ k $ - best MST Parsing and Tree-Ranking . We present our system used in the CoNLL 2007 shared task on multilingual parsing . The system is composed of three components : a k-best maximum spanning tree -LRB- MST -RRB- parser , a tree labeler , and a reranker that orders the k-best labeled trees . We present two techniques for training the MST parser : tree-normalized and graphnormalized conditional training . The treebased reranking model allows us to explicitly model global syntactic phenomena . We describe the reranker features which include non-projective edge attributes . We provide an analysis of the errors made by our system and suggest changes to the models and features that might rectify the current system . ##I08-1056 Cluster-Based Query Expansion for Statistical Question Answering . Document retrieval is a critical component of question answering -LRB- QA -RRB- , yet little work has been done towards statistical modeling of queries and towards automatic generation of high quality query content for QA . This paper introduces a new , cluster-based query expansion method that learns queries known to be successful when applied to similar questions . We show that cluster-based expansion improves the retrieval performance of a statistical question answering system when used in addition to existing query expansion methods . This paper presents experiments with several feature selection methods used individually and in combination . We show that documents retrieved using the cluster-based approach are inherently different than documents retrieved using existing methods and provide a higher data diversity to answers extractors . ##W09-0206 Positioning for Conceptual Development using Latent Semantic Analysis . With increasing opportunities to learn online , the problem of positioning learners in an educational network of content offers new possibilities for the utilisation of geometry-based natural language processing techniques . In this article , the adoption of latent semantic analysis -LRB- LSA -RRB- for guiding learners in their conceptual development is investigated . We propose five new algorithmic derivations of LSA and test their validity for positioning in an experiment in order to draw back conclusions on the suitability of machine learning from previously accredited evidence . Special attention is thereby directed towards the role of distractors and the calculation of thresholds when using similarities as a proxy for assessing conceptual closeness . Results indicate that learning improves positioning . Distractors are of low value and seem to be replaceable by generic noise to improve threshold calculation . Furthermore , new ways to flexibly calculate thresholds could be identified . ##P06-1079 Exploiting Syntactic Patterns As Clues In Zero-Anaphora Resolution . We approach the zero-anaphora resolution problem by decomposing it into intra-sentential and inter-sentential zeroanaphora resolution . For the former problem , syntactic patterns of the appearance of zero-pronouns and their antecedents are useful clues . Taking Japanese as a target language , we empirically demonstrate that incorporating rich syntactic pattern features in a state-of-the-art learning-based anaphora resolution model dramatically improves the accuracy of intra-sentential zero-anaphora , which consequently improves the overall performance of zeroanaphora resolution . ##I08-3020 Speech to speech machine translation : Biblical chatter from Finnish to English . Speech-to-speech machine translation is in some ways the peak of natural language processing , in that it deals directly with our original , oral mode of communication -LRB- as opposed to derived written language -RRB- . As such , it presents challenges that are not to be taken lightly . Although existing technology covers each of the steps in the process , from speech recognition to synthesis , deriving a model of translation that is effective in the domain of spoken language is an interesting and challenging task . If we could teach our algorithms to learn as children acquire language , the result would be useful both for language technology and cognitive science . We propose several potential approaches , an implementation of a multi-path model that translates recognized morphemes alongside words,andaweb-interfacetotestourspeech translation tool as trained for Finnish to English . We also discuss current approaches to machine translation and the problems they face in adapting simultaneously to morphologically rich languages and to the spoken modality . ##C92-1027 Compiling And Using Finite-State Syntactic Rules . A language-independent framework for syntactic finlte-state parsing is discussed . The article presents a framework , a formalism , a compiler and a parser for grammars written in this forrealism . As a substantial example , fragments from a nontrivial finite-state grammar of English are discussed . The linguistic framework of the present approach is based on a surface syntactic tagging scheme by F. Karlsson . This representation is slightly less powerful than phrase structure tree notation , letUng some ambiguous constructions be described more concisely . The finite-state rule compiler implements what was briefly sketched by Koskenniemi -LRB- 1990 -RRB- . It is based on the calculus of finite-state machines . The compiler transforms rules into rule-automata . The run-time parser exploits one of certain alternative strategies in performing the effective intersection of the rule automata and the sentence automaton . Fragments of a fairly comprehensive finite-state granmmr of English axe presented here , including samples from non-finite constructions as a demonstration of the capacity of the present formalism , which goes far beyond plain disamblguation or part of speech tagging . The grammar itself is directly related to a parser and tagging system for English created as a part of project SIMPR I using Karlsson 's CG -LRB- Constraint Grammar -RRB- formalism . ##I05-1018 Adapting a Probabilistic Disambiguation Model of an HPSG Parser to a New Domain . This paper describes a method of adapting a domain-independent HPSG parser to a biomedical domain . Without modifying the grammar and the probabilistic model of the original HPSG parser , we develop a log-linear model with additional features on a treebank of the biomedical domain . Since the treebank of the target domain is limited , we need to exploit an original disambiguation model that was trained on a larger treebank . Our model incorporates the original model as a reference probabilistic distribution . The experimental results for our model trained with a small amount of a treebank demonstrated an improvement in parsing accuracy . ##C00-1002 Learning Word Clusters From Data Types . The paper illustrates a linguistic knowledge acquisition model making use of data types , infinite nlenlory , and an inferential mechanism tbr inducing new intbrmation Dora known data . The mode -RRB- is colnpared with standard stochastic lnethods applied to data tokens , and tested on a task of lexico semantic classification . ##N03-2029 Automatic Derivation Of Surface Text Patterns For A Maximum Entropy Based Question Answering System . In this paper we investigate the use of surface text patterns for a Maximum Entropy based Question Answering -LRB- QA -RRB- system . These text patterns are collected automatically in an unsupervised fashion using a collection of trivia question and answer pairs as seeds . These patterns are used to generate features for a statistical question answering system . We report our results on the TREC-10 question set . ##W97-0312 Learning To Tag Multilingual Texts Through Observation . This paper describes RoboTag , an advanced prototype for a machine learningbased multilingual information extraction system . First , we describe a general client\/server architecture used in learning from observation . Then we give a detailed description of our novel decision-tree tagging approach . RoboTag performance for the proper noun tagging task in English and Japanese is compared against humantagged keys and to the best hand-coded pattern performance -LRB- as reported in the MUC and MET evaluation results -RRB- . Related work and future directions are presented . ##N04-1017 Lattice-Based Search For Spoken Utterance Retrieval . Recent work on spoken document retrieval has suggested that it is adequate to take the singlebest output of ASR , and perform text retrieval on this output . This is reasonable enough for the task of retrieving broadcast news stories , where word error rates are relatively low , and the stories are long enough to contain much redundancy . But it is patently not reasonable if one 's task is to retrieve a short snippet of speech in a domain where WER 's can be as high as 50 % ; such would be the situation with teleconference speech , where one 's task is to find if and when a participant uttered a certain phrase . In this paper we propose an indexing procedure for spoken utterance retrieval that works on lattices rather than just single-best text . We demonstrate that this procedure can improve F scores by over five points compared to singlebest retrieval on tasks with poor WER and low redundancy . The representation is flexible so that we can represent both word lattices , as well as phone lattices , the latter being important for improving performance when searching for phrases containing OOV words . ##W03-1308 Bio-Medical Entity Extraction Using Support Vector Machines . Support Vector Machines have achieved state of the art performance in several classification tasks . In this article we apply them to the identification and semantic annotation of scientific and technical terminology in the domain of molecular biology . This illustrates the extensibility of the traditional named entity task to special domains with extensive terminologies such as those in medicine and related disciplines . We illustrate SVM 's capabilities using a sample of 100 journal abstracts texts taken from the fhuman , blood cell , transcription factorg domain of MEDLINE . Approximately 3400 terms are annotated and the model performs at about 74 % F-score on cross-validation tests . A detailed analysis based on empirical evidence shows the contribution of various feature sets to performance . ##P07-1119 Substring-Based Transliteration . Transliteration is the task of converting a word from one alphabetic script to another . We present a novel , substring-based approach to transliteration , inspired by phrasebased models of machine translation . We investigate two implementations of substringbased transliteration : a dynamic programming algorithm , and a finite-state transducer . We show that our substring-based transducer not only outperforms a state-of-the-art letterbased approach by a significant margin , but is also orders of magnitude faster . ##W09-2103 Inferring Tutorial Dialogue Structure with Hidden Markov Modeling . The field of intelligent tutoring systems has seen many successes in recent years . A significant remaining challenge is the automatic creation of corpus-based tutorial dialogue management models . This paper reports on early work toward this goal . We identify tutorial dialogue modes in an unsupervised fashion using hidden Markov models -LRB- HMMs -RRB- trained on input sequences of manually-labeled dialogue acts and adjacency pairs . The two best-fit HMMs are presented and compared with respect to the dialogue structure they suggest ; we also discuss potential uses of the methodology for future work . ##W09-1106 Efficient Linearization of Tree Kernel Functions . The combination of Support Vector Machines with very high dimensional kernels , such as string or tree kernels , suffers from two major drawbacks : first , the implicit representation of feature spaces does not allow us to understand which features actually triggered the generalization ; second , the resulting computational burden may in some cases render unfeasible to use large data sets for training . We propose an approach based on feature space reverse engineering to tackle both problems . Our experiments with Tree Kernels on a Semantic Role Labeling data set show that the proposed approach can drastically reduce the computational footprint while yielding almost unaffected accuracy . ##P08-1090 Unsupervised Learning of Narrative Event Chains . Hand-coded scripts were used in the 1970-80s as knowledge backbones that enabled inference and other NLP tasks requiring deep semantic knowledge . We propose unsupervised induction of similar schemata called narrative event chains from raw newswire text . A narrative event chain is a partially ordered set of events related by a common protagonist . We describe a three step process to learning narrative event chains . The first uses unsupervised distributional methods to learn narrative relations between events sharing coreferring arguments . The second applies a temporal classifier to partially order the connected events . Finally , the third prunes and clusters self-contained chains from the space of events . We introduce two evaluations : the narrative cloze to evaluate event relatedness , and an order coherence task to evaluate narrative order . We show a 36 % improvement over baseline for narrative prediction and 25 % for temporal coherence . ##W02-1106 Translating Lexical Semantic Relations : The First Step Towards Multilingual Wordnets . Establishing correspondences between wordnets of different languages is essential to both multilingual knowledge processing and for bootstrapping wordnets of low-density languages . We claim that such correspondences must be based on lexical semantic relations , rather than top ontology or word translations . In particular , we define a translation equivalence relation as a bilingual lexical semantic relation . Such relations can then be part of a logical entailment predicting whether source language semantic relations will hold in a target language or not . Our claim is tested with a study of 210 Chinese lexical lemmas and their possible semantic relations links bootstrapped from the Princeton WordNet . The results show that lexical semantic relation translations are indeed highly precise when they are logically inferable . ##A00-3005 Corpus-Based Syntactic Error Detection Using Syntactic Patterns . This paper presents a parsing system for the detection of syntactic errors . It combines a robust partial parser which obtains the main sentence components and a finite-state parser used for the description of syntactic error patterns . The system has been tested on a corpus of real texts , containing both correct and incorrect sentences , with promising results . ##W09-2206 Latent Dirichlet Allocation with Topic-in-Set Knowledge . Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data . We propose a mechanism for adding partial supervision , called topic-in-set knowledge , to latent topic modeling . This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise . Preliminary experiments on text datasets are presented to demonstrate the potential effectiveness of this method . ##D08-1073 Jointly Combining Implicit Constraints Improves Temporal Ordering . Previous work on ordering events in text has typically focused on local pairwise decisions , ignoring globally inconsistent labels . However , temporal ordering is the type of domain in which global constraints should be relatively easy to represent and reason over . This paper presents a framework that informs local decisions with two types of implicit global constraints : transitivity -LRB- A before B and B beforeC implies AbeforeC -RRB- and time expression normalization -LRB- e.g. last month is before yesterday -RRB- . We show how these constraints can be used to create a more densely-connected network of events , and how global consistency can be enforced by incorporating these constraints into an integer linear programming framework . We present results on two event ordering tasks , showing a 3.6 % absolute increase in the accuracy of before\/after classification over a pairwise model . ##W06-2607 Tree Kernel Engineering In Semantic Role Labeling Systems . Recent work on the design of automatic systems for semantic role labeling has shown that feature engineering is a complex task from a modeling and implementation point of view . Tree kernels alleviate suchcomplexityaskernelfunctionsgenerate features automatically and require less software development for data extraction . In this paper , we study several tree kernel approaches for both boundary detection and argument classification . The comparative experiments on Support Vector Machines with such kernels on the CoNLL 2005 dataset show that very simple tree manipulations trigger automatic feature engineering that highly improves accuracy and efficiency in both phases . Moreover , the use of different classifiers for internal andpre-terminalnodesmaintainsthesame accuracy and highly improves efficiency . ##W04-1119 A Semi-Supervised Approach To Build Annotated Corpus For Chinese Named Entity Recognition . 1 This paper presents a semi-supervised approach to reduce human effort in building an annotated Chinese corpus . One of the disadvantages of many statistical Chinese named entity recognition systems is that training data may be in short supply , and manually building annotated corpus is expensive . In the proposed approach , we construct an 80M handannotated corpus in three steps : -LRB- 1 -RRB- Automatically annotate training corpus ; -LRB- 2 -RRB- Manually refine small subsets of the automatically annotated corpus ; -LRB- 3 -RRB- Combine small subsets and whole corpus in a bootstrapping process . Our approach is tested on a state-ofthe-art Chinese word segmentation system -LRB- Gao et al. , 2003 , 2004 -RRB- . Experiments show that only a small subset of hand-annotated corpus is sufficient to achieve a satisfying performance of the named entity component in this system . ##I08-5014 Named Entity Recognition for Indian Languages . Abstract Stub This paper talks about a new approach to recognize named entities for Indian languages . Phonetic matching technique is used to match the strings of different languages on the basis of their similar sounding property . We have tested our system with a comparable corpus of English and Hindi language data . This approach is language independent and requires only a set of rules appropriate for a language . ##C88-2118 Parsing Noisy Sentences . This paper describes a method to parse and understand a `` noisy '' sentence that possibly includes errors caused by a speech recognition device . Our parser is connected to a speech recognition device which takes a continuously spoken sentence in Japanese and produces a sequence of phonemes . The output sequence of phonemes can quite possibly include errors : altered phonemes , extra phonemes and missing phonemes . The task is to parse the noisy phoneme sequence and understand the meaning of the original input sentence , given an augmented context-free grammar whose terminal symbols are phonemes . A very efficient parsing method is required , as the task 's search space is much larger than that of parsing un-noisy sentences . We adopt the generalized LR parsing algorithm , and a certain scoring scheme to select the most likely sentence o ~ t of multiple sentence candidates . The use of a confusion matrix , which is created in advance by analyzing a large set of input\/output pairs , is discussed to improve the scoring accuracy . The system has been integrated into CMU 's knowledge-based machine translation system . ##C08-1110 A Framework for Identifying Textual Redundancy . The task of identifying redundant information in documents that are generated from multiple sources provides a significant challenge for summarization and QA systems . Traditional clustering techniques detect redundancy at the sentential level and do not guarantee the preservation of all information within the document . We discuss an algorithm that generates a novel graph-based representation for a document and then utilizes a set cover approximation algorithm to remove redundant text from it . Our experiments show that this approach offers a significant performance advantage over clustering when evaluated over an annotated dataset . ##J87-3003 Tools And Methods For Computational Lexicology . This paper presents a set of tools and methods for acquiring , manipulating , and analyzing machinereadable dictionaries . We give several detailed examples of the use of these tools and methods for particular analyses . A novel aspect of our work is that it allows the combined processing of multiple machine-readable dictionaries . Our examples describe analyses of data from Webster 's Seventh Collegiate Dictionary , the Longman Dictionary of Contemporary English , the Collins bilingual dictionaries , the Collins Thesaurus , and the Zingarelli Italian dictionary . We describe existing facilities and results they have produced as well as planned enhancements to those facilities , particularly in the area of managing associations involving the senses of polysemous words . We show how these enhancements expand the ways in which we can exploit machine-readable dictionaries in the construction of large lexicons for natural language processing systems . ##C96-2213 Using A Hybrid System Of Corpus - And Knowledge-Based Techniques To Automate The Induction Of A Lexical Sublanguage Grammar . Porting a Natural Language Processing -LRB- NLP -RRB- system to a new donmin renmins one of the bottlenecks in syntactic parsing , because of the amount of effort required to fix gaps in the lexicon , and to attune the existing grammar to the idiosyncracics of the new sublanguage . This paper shows how thc process of fitting a lexicalizcd grammar to a domain can be automated to a great extent by using a hybrid system that combines traditimml knowledgebased techniques with a corpus-based approach . ##P09-3011 Clustering Technique in Multi-Document Personal Name Disambiguation . Focusing on multi-document personal name disambiguation , this paper develops an agglomerative clustering approach to resolving this problem . We start from an analysis of pointwise mutual information between feature and the ambiguous name , which brings about a novel weight computing method for feature in clustering . Then a trade-off measure between within-cluster compactness and among-cluster separation is proposed for stopping clustering . After that , we apply a labeling method to find representative feature for each cluster . Finally , experiments are conducted on word-based clustering in Chinese dataset and the result shows a good effect . ##D07-1095 Inducing Search Keys for Name Filtering . This paper describes ETK -LRB- Ensemble of Transformation-based Keys -RRB- a new algorithm for inducing search keys for name filtering . ETK has the low computational cost and ability to filter by phonetic similarity characteristic of phonetic keys such as Soundex , but is adaptable to alternative similarity models . The accuracy of ETK in a preliminary empirical evaluation suggests that it is well-suited for phonetic filtering applications such as recognizing alternative cross-lingual transliterations . ##I08-2138 A Punjabi Grammar Checker . This article provides description about the grammar checking software developed for detecting the grammatical errors in Punjabi texts and providing suggestions wherever appropriate to rectify those errors . This system utilizes a full-form lexicon for morphology analysis and rule-based systems for part of speech tagging and phrase chunking . The system supported by a set of carefully devised error detection rules can detect and suggest rectifications for a number of grammatical errors , resulting from lack of agreement , order of words in various phrases etc. , in literary style Punjabi texts . ##P06-2108 Using Word Support Model To Improve Chinese Input System . This paper presents a word support model -LRB- WSM -RRB- . The WSM can effectively perform homophone selection and syllable-word segmentation to improve Chinese input systems . The experimental results show that : -LRB- 1 -RRB- the WSM is able to achieve tonal -LRB- syllables input with four tones -RRB- and toneless -LRB- syllables input without four tones -RRB- syllable-to-word -LRB- STW -RRB- accuracies of 99 % and 92 % , respectively , among the converted words ; and -LRB- 2 -RRB- while applying the WSM as an adaptation processing , together with the Microsoft Input Method Editor 2003 -LRB- MSIME -RRB- and an optimized bigram model , the average tonal and toneless STW improvements are 37 % and 35 % , respectively . ##W09-2201 Coupling Semi-Supervised Learning of Categories and Relations . We consider semi-supervised learning of information extraction methods , especially for extracting instances of noun categories -LRB- e.g. , ` athlete , ' ` team ' -RRB- and relations -LRB- e.g. , ` playsForTeam -LRB- athlete , team -RRB- ' -RRB- . Semisupervised approaches using a small number of labeled examples together with many unlabeled examples are often unreliable as they frequently produce an internally consistent , but nevertheless incorrect set of extractions . We propose that this problem can be overcome by simultaneously learning classifiers for many different categories and relations in the presence of an ontology defining constraints that couple the training of these classifiers . Experimental results show that simultaneously learning a coupled collection of classifiers for 30 categories and relations results in much more accurate extractions than training classifiers individually . ##W93-0110 Acquiring Predicate-Argument Mapping Information From Multilingual Texts . This paper discusses automatic acquisition of predicate-argument mapping information from multilingual texts . The lexicon of our NLP system abstracts the language-dependent portion of predicate-argument mapping information from the core meaning of verb senses -LRB- i.e. semantic concepts as defined in the knowledge base -RRB- . We represent this mapping information in terms of cross-linguistically generalized mapping types called situation types and word sense-specific idiosyncrasies . This representation has enabled us to automatically acquire predicate-argument mapping information , specifically situation types and idiosyncrasies , for verbs in English , Spanish , and Japanese texts . ##W04-3007 Robustness Issues In A Data-Driven Spoken Language Understanding System . Robustness is a key requirement in spoken language understanding -LRB- SLU -RRB- systems . Human speech is often ungrammatical and ill-formed , and there will frequently be a mismatch between training and test data . This paper discusses robustness and adaptation issues in a statistically-based SLU system which is entirely data-driven . To test robustness , the system has been tested on data from the Air Travel Information Service -LRB- ATIS -RRB- domain which has been artificially corrupted with varying levels of additive noise . Although the speech recognition performance degraded steadily , the system did not fail catastrophically . Indeed , the rate at which the end-to-end performance of the complete system degraded was significantly slower than that of the actual recognition component . In a second set of experiments , the ability to rapidly adapt the core understanding component of the system to a different application within the same broad domain has been tested . Using only a small amount of training data , experiments have shown that a semantic parser based on the Hidden Vector State -LRB- HVS -RRB- model originally trained on the ATIS corpus can be straightforwardly adapted to the somewhat different DARPA Communicator task using standard adaptation algorithms . The paper concludes by suggesting that the results presented provide initial support to the claim that an SLU system which is statistically-based and trained entirely from data is intrinsically robust and can be readily adapted to new applications . ##I05-1013 Automatic Partial Parsing Rule Acquisition Using Decision Tree Induction . Partial parsing techniques try to recover syntactic information efficiently and reliably by sacrificing completeness and depth of analysis . One of the difficulties of partial parsing is finding a means to extract the grammar involved automatically . In this paper , we present a method for automatically extracting partial parsing rules from a tree-annotated corpus using decision tree induction . We define the partial parsing rules as those that can decide the structure of a substring in an input sentence deterministically . This decision can be considered as a classification ; as such , for a substring in an input sentence , a proper structure is chosen among the structures occurred in the corpus . For the classification , we use decision tree induction , and induce partial parsing rules from the decision tree . The acquired grammar is similar to a phrase structure grammar , with contextual and lexical information , but it allows building structures of depth one or more . Our experiments showed that the proposed partial parser using the automatically extracted rules is not only accurate and efficient , but also achieves reasonable coverage for Korean . ##W97-0811 An Experiment In Semantic Tagging Using Hidden Markov Model Tagging . The same word can have many different meanings depending on the context in which it is used . Discovering the meaning of a word , given the text around it , has been an interesting problem for both the psychology and the artificial intelligence research communities . In this article , we present a series of experiments , using methods which have proven to be useful for eliminating part-of-speech ambiguity , to see if such simple methods can be used to resolve semantic ambiguities . Using a publicly available semantic lexicon , we find the Hidden Markov Models work surprising well at choosing the right semantic categories , once the sentence has been stripped of purely functional words . ##E06-1002 Using Encyclopedic Knowledge For Named Entity Disambiguation . We present a new method for detecting and disambiguating named entities in open domain text . A disambiguation SVM kernel is trained to exploit the high coverage and rich structure of the knowledge encoded in an online encyclopedia . The resulting model significantly outperforms a less informed baseline . ##N04-3006 Open Text Semantic Parsing Using FrameNet And WordNet . This paper describes a rule-based semantic parser that relies on a frame dataset -LRB- FrameNet -RRB- , and a semantic network -LRB- WordNet -RRB- , to identify semantic relations between words in open text , as well as shallow semantic features associated with concepts in the text . Parsing semantic structures allows semantic units and constituents to be accessed and processed in a more meaningful way than syntactic parsing , moving the automation of understanding natural language text to a higher level . ##P06-2107 Statistical Phrase-Based Models For Interactive Computer-Assisted Translation . Obtaining high-quality machine translations is still a long way off . A postediting phase is required to improve the output of a machine translation system . An alternative is the so called computerassisted translation . In this framework , a human translator interacts with the system in order to obtain high-quality translations . A statistical phrase-based approach to computer-assisted translation is described in this article . A new decoder algorithm for interactive search is also presented , that combines monotone and nonmonotone search . The system has been assessed in the TransType-2 project for the translation of several printer manuals , from -LRB- to -RRB- English to -LRB- from -RRB- Spanish , German and French . ##P98-1068 Japanese Morphological Analyzer using Word Co-occurrence - JTAG . We developed a Japanese morphological analyzer that uses the co-occurrence of words to select the correct sequence of words in an unsegmented Japanese sentence . The co-occurrence information can be obtained from cases where the system incorrectly analyzes sentences . As the amount of information increases , the accuracy of the system increases with a small risk of degradation . Experimental results show that the proposed system assigns the correct phonological representations to unsegmented Japanese sentences more precisely than do other popular systems . ##D07-1082 Active Learning for Word Sense Disambiguation with Methods for Addressing the Class Imbalance Problem . In this paper , we analyze the effect of resampling techniques , including undersampling and over-sampling used in active learning for word sense disambiguation -LRB- WSD -RRB- . Experimental results show that under-sampling causes negative effects on active learning , but over-sampling is a relatively good choice . To alleviate the withinclass imbalance problem of over-sampling , we propose a bootstrap-based oversampling -LRB- BootOS -RRB- method that works better than ordinary over-sampling in active learning for WSD . Finally , we investigate when to stop active learning , and adopt two strategies , max-confidence and min-error , as stopping conditions for active learning . According to experimental results , we suggest a prediction solution by considering max-confidence as the upper bound and min-error as the lower bound for stopping conditions . ##P98-2138 Combining Trigram and Winnow in Thai OCR Error Correction . For languages that have no explicit word boundary such as Thai , Chinese and Japanese , correcting words in text is harder than in English because of additional ambiguities in locating error words . The traditional method handles this by hypothesizing that every substrings in the input sentence could be error words and trying to correct all of them . In this paper , we propose the idea of reducing the scope of spelling correction by focusing only on dubious areas in the input sentence . Boundaries of these dubious areas could be obtained approximately by applying word segmentation algorithm and finding word sequences with low probability . To generate the candidate correction words , we used a modified edit distance which reflects the characteristic of Thai OCR errors . Finally , a part-ofspeech trigram model and Winnow algorithm are combined to determine the most probable correction . ##E06-1049 A Machine Learning Approach To Extract Temporal Information From Texts In Swedish And Generate Animated 3D Scenes . Carsim is a program that automatically converts narratives into 3D scenes . Carsim considers authentic texts describing road accidents , generally collected from web sitesofSwedishnewspapers ortranscribed from hand-written accounts by victims of accidents . One of the program 's key featuresisthatitanimatesthegenerated scene to visualize events . To create a consistent animation , Carsim extracts the participants mentioned in a text and identifies what they do . In this paper , we focus on the extraction of temporal relations between actions . We first describe how we detect time expressions and events . We then present a machine learning technique to order the sequence of events identified in the narratives . We finally report the results we obtained . ##W00-0309 Task-Based Dialog Management Using An Agenda . Dialog man tigement addresses two specific problems : -LRB- 1 -RRB- providing a coherent overall structure to interaction that extends beyond the single turn , -LRB- 2 -RRB- correctly managing mixedinitiative interaction . We propose a dialog management architecture based on the following elements : handlers that manage interaction focussed on tightly coupled sets of information , a product that reflects mutually agreed-upon information and an agenda that orders the topics relevant to task completion . ##W02-0713 Sharing Problems And Solutions For Machine Translation Of Spoken And Written Interaction . Examples from chat interaction are presented to demonstrate that machine translation of written interaction shares many problems with translation of spoken interaction . The potential for common solutions to the problems is illustrated by describing operations that normalize and tag input before translation . Segmenting utterances into small translation units and processing short turns separately are also motivated using data from chat . ##C96-2204 Constructing Verb Semantic Classes For French : Methods And Evaluation . In this paper , we study a reformulation , which is better adapted to NLP , of the alternation system developed for English by B. Levin . We have studied a set of 1700 verbs from which we explain how verb semantic classes can be built in a systematic way . The quality of the results w.r. t , semantic chLssifications such as WordNet is then evaluated . ##P02-1060 Named Entity Recognition Using An HMM-Based Chunk Tagger . This paper proposes a Hidden Markov Model -LRB- HMM -RRB- and an HMM-based chunk tagger , from which a named entity -LRB- NE -RRB- recognition -LRB- NER -RRB- system is built to recognize and classify names , times and numerical quantities . Through the HMM , our system is able to apply and integrate four types of internal and external evidences : 1 -RRB- simple deterministic internal feature of the words , such as capitalization and digitalization ; 2 -RRB- internal semantic feature of important triggers ; 3 -RRB- internal gazetteer feature ; 4 -RRB- external macro context feature . In this way , the NER problem can be resolved effectively . Evaluation of our system on MUC-6 and MUC-7 English NE tasks achieves F-measures of 96.6 % and 94.1 % respectively . It shows that the performance is significantly better than reported by any other machine-learning system . Moreover , the performance is even consistently better than those based on handcrafted rules . ##W98-1415 Clause Aggregation Using Linguistic Knowledge . By combining multiple clauses into one single sentence , a text generation system can express the same amount of information in fewer words and at the same time , produce a great variety of complex constructions . In this paper , we describe hypotactic and paratactic operators for generating complex sentences from clause-sized semantic representations . These two types of operators are portable and reusable because they are based on general resources such as the lexicon and the grammar . ##W08-0403 Prior Derivation Models For Formally Syntax-Based Translation Using Linguistically Syntactic Parsing and Tree Kernels . This paper presents an improved formally syntax-based SMT model , which is enriched by linguistically syntactic knowledge obtained from statistical constituent parsers . We propose a linguistically-motivated prior derivation model to score hypothesis derivations on top of the baseline model during the translation decoding . Moreover , we devise a fast training algorithm to achieve such improved models based on tree kernel methods . Experiments on an English-to-Chinese task demonstrate that our proposed models outperformed the baseline formally syntaxbased models , while both of them achieved signi cant improvements over a state-of-theart phrase-based SMT system . ##W03-0416 An Efficient Clustering Algorithm For Class-Based Language Models . This paper defines a general form for classbased probabilistic language models and proposes an efficient algorithm for clustering based on this . Our evaluation experiments revealed that our method decreased computation time drastically , while retaining accuracy . ##X98-1016 Transforming Examples Into Patterns For Information Extraction . Information Extraction -LRB- IE -RRB- systems today are commonly based on pattern matching . The patterns are regular expressions stored in a customizable knowledge base . Adapting an IE system to a new subject domain entails the construction of a new pattern base - a time-consuming and expensive task . We describe a strategy for building patterns from examples . To adapt the IE system to a new domain quickly , the user chooses a set of examples in a training text , and for each example gives the logical form entries which the example induces . The system transforms these examples into patterns and then applies meta-rules to generalize these patterns . ##C08-2011 Word Sense Disambiguation for All Words using Tree-Structured Conditional Random Fields . We propose a supervised word sense disambiguation -LRB- WSD -RRB- method using tree-structured conditional random fields -LRB- TCRFs -RRB- . By applying TCRFs to a sentence described as a dependency tree structure , we conduct WSD as a labeling problem on tree structures . To incorporate dependencies between word senses , we introduce a set of features on tree edges , in combination with coarse-grained tagsets , and show that these contribute to an improvement in WSD accuracy . We also show that the tree-structured model outperforms the linear-chain model . Experiments on the SENSEVAL-3 data set show that our TCRF model performs comparably with state-of-the-art WSD systems . ##D09-1084 A Relational Model of Semantic Similarity between Words using Automatically Extracted Lexical Pattern Clusters from the Web . Semantic similarity is a central concept that extends across numerous fields such as artificial intelligence , natural language processing , cognitive science and psychology . Accurate measurement of semantic similarity between words is essential for various tasks such as , document clustering , information retrieval , and synonym extraction . We propose a novel model of semantic similarity using the semantic relations that exist among words . Given two words , first , we represent the semantic relations that hold between those words using automatically extracted lexical pattern clusters . Next , the semantic similarity between the two words is computed using a Mahalanobis distance measure . We compare the proposed similarity measure against previously proposed semantic similarity measures on Miller-Charles benchmark dataset and WordSimilarity353 collection . The proposed method outperforms all existing web-based semantic similarity measures , achieving a Pearson correlation coefficient of 0.867 on the Millet-Charles dataset . ##P08-1117 Extraction of Entailed Semantic Relations Through Syntax-Based Comma Resolution . This paper studies textual inference by investigating comma structures , which are highly frequent elements whose major role in the extraction of semantic relations has not been hitherto recognized . We introduce the problem of comma resolution , defined as understanding the role of commas and extracting the relations they imply . We show the importance of the problem using examples from Textual Entailment tasks , and present A Sentence Transformation Rule Learner -LRB- ASTRL -RRB- , a machine learning algorithm that uses a syntactic analysis of the sentence to learn sentence transformation rules that can then be used to extract relations . We have manually annotated a corpus identifying comma structures and relations they entail and experimented with both gold standard parses and parses created by a leading statistical parser , obtaining F-scores of 80.2 % and 70.4 % respectively . ##W98-0719 Lexical Discovery With An Enriched Semantic Network . The study of lexical semantics has produced a systematic analysis of binary relationships between content words that has greatly benefited lexical search tools and natural language processing algorithms . We first introduce a database system called FreeNet that facilitates the description and exploration of finite binary relations . We then describe the design and implementation of Lexical FreeNet , a semantic network that mixes WordNet-derived semantic relations with data-derived and phonetically-derived relations . We discuss how Lexical FreeNet has aided in lexical discovery , the pursuit of linguistic and factual knowledge by the computer-aided exploration of lexical relations . ##I08-1008 Name Origin Recognition Using Maximum Entropy Model and Diverse Features . Name origin recognition is to identify the source language of a personal or location name . Some early work used either rulebased or statistical methods with single knowledge source . In this paper , we cast the name origin recognition as a multi-class classification problem and approach the problem using Maximum Entropy method . In doing so , we investigate the use of different features , including phonetic rules , ngram statistics and character position information for name origin recognition . Experiments on a publicly available personal name database show that the proposed approach achieves an overall accuracy of 98.44 % for names written in English and 98.10 % for names written in Chinese , which are significantly and consistently better than those in reported work . ##J86-2002 Summarizing Natural Language Database Responses . In a human dialogue it is usually considered inappropriate if one conversant monopolizes the conversation . Similarly it can be inappropriate for a natural language database interface to respond with a lengthy list of data . A non-enumerative `` summary '' response is less verbose and often avoids misleading the user where an extensional response might . In this paper we investigate the problem of generating such discourse-oriented concise responses . We present details of the design and implementation of a system that produces summary responses to queries of a relational data base . The system employs a set of heuristics that work in conjunction with a knowledge base to discover underlying regularities that form the basis of summary responses . The system is largely domain-independent , and hence can be ported relatively easily from one data base to another . It can handle a wide variety of situations requiring a summary response and can be readily extended . It also has a number of shortcomings which are discussed thoroughly and which form the basis for a number of suggested research directions . ##P06-1008 Acceptability Prediction By Means Of Grammaticality Quantification . We propose in this paper a method for quantifying sentence grammaticality . The approach based on Property Grammars , a constraint-based syntactic formalism , makes it possible to evaluate a grammaticality index for any kind of sentence , including ill-formed ones . We compare on a sample of sentences the grammaticality indices obtained from PG formalism and the acceptability judgements measured by means of a psycholinguistic analysis . The results show that the derived grammaticality index is a fairly good tracer of acceptability scores . ##W00-1427 Robust , Applied Morphological Generation . In practical natural language generation systems it is often advantageous to have a separate component that deals purely with morphological processing . We present such a component : a fast and robust morphological generator for English based on finite-state techniques that generates a word form given a specification of the lemma , part-of-speech , and the type of inflection required . We describe how this morphological generator is used in a prototype system for automatic simplification of English newspaper text , and discuss practical morphological and orthographic issues we have encountered in generation of unrestricted text within this application . ##W09-1216 Multilingual Semantic Parsing with a Pipeline of Linear Classifiers . I describe a fast multilingual parser for semantic dependencies . The parser is implemented as a pipeline of linear classifiers trained with support vector machines . I use only first order features , and no pair-wise feature combinations in order to reduce training and prediction times . Hyper-parameters are carefully tuned for each language and sub-problem . The system is evaluated on seven different languages : Catalan , Chinese , Czech , English , German , Japanese and Spanish . An analysis of learning rates and of the reliance on syntactic parsing quality shows that only modest improvements could be expected for most languages given more training data ; Better syntactic parsing quality , on the other hand , could greatly improve the results . Individual tuning of hyper-parameters is crucial for obtaining good semantic parsing quality . ##W05-0902 On The Subjectivity Of Human Authored Summaries . We address the issue of human subjectivity when authoring summaries , aiming at a simple , robust evaluation of machine generated summaries . Applying a cross comprehension test on human authored short summaries from broadcast news , the level of subjectivity is gauged among four authors . The instruction set is simple , thus there is enough room for subjectivity . However the approach is robust because the test does not use the absolute score , relying instead on relative comparison , effectively alleviating the subjectivity . Finally we illustrate the application of the above scheme when evaluating the informativeness of machine generated summaries . ##W06-3501 Pragmatic Information Extraction From Subject Ellipsis In Informal English . Subject ellipsis is one of the characteristics of informal English . The investigation of subject ellipsis in corpora thus reveals an abundance of pragmatic and extralinguistic information associated with subject ellipsis that enhances natural language understanding . In essence , the presence of subject elipsis conveys an ` informal ' conversation involving 1 -RRB- an informal ` Topic ' as well as familiar\/close ` Participants ' , 2 -RRB- specific ` Conotations ' that are different from the corresponding ful sentences : interruptive -LRB- ending discourse coherence -RRB- , polite , intimate , friendly , and less determinate implicatures . This paper also construes linguistic environments that triger the use of subject ellipsis and resolve subject elipsis . ##C02-1081 Data-DrivenClassification Of Linguistic Styles In Spoken Dialogues . Language users have individual linguistic styles . A spoken dialogue system may benefit from adapting to the linguistic style of a user in input analysis and output generation . To investigate the possibility to automatically classify speakers according to their linguistic style three corpora of spoken dialogues were analyzed . Several numerical parameters were computed for every speaker . These parameters were reduced to linguistically interpretable components by means of a principal component analysis . Classes were established from these components by cluster analysis . Unseen input was classified by trained neural networks with varying error rates depending on corpus type . A first investigation in using special language models for speaker classes was carried out . ##I05-1048 A Lexicon-Constrained Character Model for Chinese Morphological Analysis . This paper proposes a lexicon-constrained character model that combines both word and character features to solve complicated issues in Chinese morphological analysis . A Chinese character-based model constrained by a lexicon is built to acquire word building rules . Each character in a Chinese sentence is assigned a tag by the proposed model . The word segmentation and partof-speech tagging results are then generated based on the character tags . The proposed method solves such problems as unknown word identification , data sparseness , and estimation bias in an integrated , unified framework . Preliminary experiments indicate that the proposed method outperforms the best SIGHAN word segmentation systems in the open track on 3 out of the 4 test corpora . Additionally , our method can be conveniently integrated with any other Chinese morphological systems as a post-processing module leading to significant improvement in performance . ##P09-2039 Extracting Comparative Sentences from Korean Text Documents Using Comparative Lexical Patterns and Machine Learning Techniques . This paper proposes how to automatically identify Korean comparative sentences from text documents . This paper first investigates many comparative sentences referring to previous studies and then defines a set of comparative keywords from them . A sentence which contains one or more elements of the keyword set is called a comparative-sentence candidate . Finally , we use machine learning techniques to eliminate non-comparative sentences from the candidates . As a result , we achieved significant performance , an F1-score of 88.54 % , in our experiments using various web documents . ##P08-2028 The Good , the Bad , and the Unknown : Morphosyllabic Sentiment Tagging of Unseen Words . The omnipresence of unknown words is a problem that any NLP component needs to address in some form . While there exist many established techniques for dealing with unknown words in the realm of POS-tagging , for example , guessing unknown words ' semantic properties is a less-explored area with greater challenges . In this paper , we study the semantic field of sentiment and propose five methods for assigning prior sentiment polarities to unknown words based on known sentiment carriers . Tested on 2000 cases , the methods mirror human judgements closely in threeand twoway polarity classification tasks , and reach accuracies above 63 % and 81 % , respectively . ##N06-1004 Segment Choice Models : Feature-Rich Models For Global Distortion In Statistical Machine Translation . This paper presents a new approach to distortion -LRB- phrase reordering -RRB- in phrasebased machine translation -LRB- MT -RRB- . Distortion is modeled as a sequence of choices during translation . The approach yields trainable , probabilistic distortion models that are global : they assign a probability to each possible phrase reordering . These `` segment choice '' models -LRB- SCMs -RRB- can be trained on `` segment-aligned '' sentence pairs ; they can be applied during decoding or rescoring . The approach yields a metric called `` distortion perplexity '' -LRB- `` disperp '' -RRB- for comparing SCMs offline on test data , analogous to perplexity for language models . A decision-tree-based SCM is tested on Chinese-to-English translation , and outperforms a baseline distortion penalty approach at the 99 % confidence level . ##P06-1004 Minimum Cut Model For Spoken Lecture Segmentation . We consider the task of unsupervised lecture segmentation . We formalize segmentation as a graph-partitioning task that optimizes the normalized cut criterion . Our approach moves beyond localized comparisons and takes into account longrange cohesion dependencies . Our results demonstrate that global analysis improves the segmentation accuracy and is robust in the presence of speech recognition errors . ##W99-0906 A Computational Approach To Deciphering Unknown Scripts . We propose and evaluate computational techniques for deciphering unknown scripts . We focus on the case in which an unfamiliar script encodes a known language . The decipherment of a brief document or inscription is driven by data about the spoken language . We consider which scripts are easy or hard to decipher , how much data is required , and whether the techniques are robust against language change over time . ##W08-1511 A Small-Vocabulary Shared Task for Medical Speech Translation . We outline a possible small-vocabulary shared task for the emerging medical speech translation community . Data would consist of about 2000 recorded and transcribed utterances collected during an evaluation of an English ↔ Spanish version of the Open Source MedSLT system ; the vocabulary covered consisted of about 450 words in English , and 250 in Spanish . The key problem in defining the task is to agree on a scoring system which is acceptable both to medical professionals and to the speech and language community . We suggest a framework for defining and administering a scoring system of this kind . ##D08-1058 Using Bilingual Knowledge and Ensemble Techniques for Unsupervised Chinese Sentiment Analysis . It is a challenging task to identify sentiment polarity of Chinese reviews because the resources for Chinese sentiment analysis are limited . Instead of leveraging only monolingual Chinese knowledge , this study proposes a novel approach to leverage reliable English resources to improve Chinese sentiment analysis . Rather than simply projecting English resources onto Chinese resources , our approach first translates Chinese reviews into English reviews by machine translation services , and then identifies the sentiment polarity of English reviews by directly leveraging English resources . Furthermore , our approach performs sentiment analysis for both Chinese reviews and English reviews , and then uses ensemble methods to combine the individual analysis results . Experimental results on a dataset of 886 Chinese product reviews demonstrate the effectiveness of the proposed approach . The individual analysis of the translated English reviews outperforms the individual analysis of the original Chinese reviews , and the combination of the individual analysis results further improves the performance . ##P08-2011 Coreference-inspired Coherence Modeling . Research on coreference resolution and summarization has modeled the way entities are realized as concrete phrases in discourse . In particular there exist models of the noun phrase syntax used for discourse-new versus discourse-old referents , and models describing the likely distance between a pronoun and its antecedent . However , models of discourse coherence , as applied to information ordering tasks , have ignored these kinds of information . We apply a discourse-new classifier and pronoun coreference algorithm to the information ordering task , and show significant improvements in performance over the entity grid , a popular model of local coherence . ##P98-1021 Spoken Dialogue Interpretation with the DOP Model . We show how the DOP model can be used for fast and robust processing of spoken input in a practical spoken dialogue system called OVIS . OVIS , Openbaar Vervoer Informatie Systeem -LRB- `` Public Transport Information System '' -RRB- , is a Dutch spoken language information system which operates over ordinary telephone lines . The prototype system is the immediate goal of the NWO 1 Priority Programme `` Language and Speech Technology '' . In this paper , we extend the original DOP model to context-sensitiveinterpretation of spoken input . The system we describe uses the OVIS corpus -LRB- 10,000 trees enriched with compositional semantics -RRB- to compute from an input word-graph the best utterance together with its meaning . Dialogue context is taken into account by dividing up the OVIS corpus into context-dependent subcorpora . Each system question triggers a subcorpus by which the user answer is analyzed and interpreted . Our experiments indicate that the context-sensitive DOP model obtains better accuracy than the original model , allowing for fast and robust processing of spoken input . ##N06-2018 MMR-Based Active Machine Learning For Bio Named Entity Recognition . This paper presents a new active learning paradigm which considers not only the uncertainty of the classifier but also the diversity of the corpus . The two measures for uncertainty and diversity were combined using the MMR -LRB- Maximal Marginal Relevance -RRB- method to give the sampling scores in our active learning strategy . We incorporated MMR-based active machinelearning idea into the biomedical namedentity recognition system . Our experimental results indicated that our strategies for active-learning based sample selection could significantly reduce the human effort . ##P09-3007 Accurate Learning for Chinese Function Tags from Minimal Features . Data-driven function tag assignment has been studied for English using Penn Treebank data . In this paper , we address the question of whether such method can be applied to other languages and Treebank resources . In addition to simply extend previous method from English to Chinese , we also proposed an effective way to recognize function tags directly from lexical information , which is easily scalable for languages that lack sufficient parsing resources or have inherent linguistic challenges for parsing . We investigated a supervised sequence learning method to automatically recognize function tags , which achieves an F-score of 0.938 on gold-standard POS -LRB- Part-ofSpeech -RRB- tagged Chinese text -- a statistically significant improvement over existing Chinese function label assignment systems . Results show that a small number of linguistically motivated lexical features are sufficient to achieve comparable performance to systems using sophisticated parse trees . ##P97-1021 A DOP Model For Semantic Interpretation . In data-oriented language processing , an annotated language corpus is used as a stochastic grammar . The most probable analysis of a new sentence is constructed by combining fragments from the corpus in the most probable way . This approach has been successfully used for syntactic analysis , using corpora with syntactic annotations such as the Penn Tree-bank . If a corpus with semantically annotated sentences is used , the same approach can also generate the most probable semantic interpretation of an input sentence . The present paper explains this semantic interpretation method . A data-oriented semantic interpretation algorithm was tested on two semantically annotated corpora : the English ATIS corpus and the Dutch OVIS corpus . Experiments show an increase in semantic accuracy if larger corpus-fragments are taken into consideration . ##W06-1207 Classifying Particle Semantics In English Verb-Particle Constructions . Previous computational work on learning the semantic properties of verb-particle constructions -LRB- VPCs -RRB- has focused on their compositionality , and has left unaddressed the issue of which meaning of the component words is being used in a given VPC . We develop a feature space for use in classification of the sense contributed by the particle in a VPC , and test this on VPCs using the particle up . The features that capture linguistic properties of VPCs that are relevant to the semantics of the particle outperform linguistically uninformed word co-occurrence features in our experiments on unseen test VPCs . ##C04-1075 A High-Performance Coreference Resolution System Using A Constraint-Based Multi-Agent Strategy . This paper presents a constraint-based multiagent strategy to coreference resolution of general noun phrases in unrestricted English text . For a given anaphor and all the preceding referring expressions as the antecedent candidates , a common constraint agent is first presented to filter out invalid antecedent candidates using various kinds of general knowledge . Then , according to the type of the anaphor , a special constraint agent is proposed to filter out more invalid antecedent candidates using constraints which are derived from various kinds of special knowledge . Finally , a simple preference agent is used to choose an antecedent for the anaphor form the remaining antecedent candidates , based on the proximity principle . One interesting observation is that the most recent antecedent of an anaphor in the coreferential chain is sometimes indirectly linked to the anaphor via some other antecedents in the chain . In this case , we find that the most recent antecedent always contains little information to directly determine the coreference relationship with the anaphor . Therefore , for a given anaphor , the corresponding special constraint agent can always safely filter out these less informative antecedent candidates . In this way , rather than finding the most recent antecedent for an anaphor , our system tries to find the most direct and informative antecedent . Evaluation shows that our system achieves Precision \/ Recall \/ F-measures of 84.7 % \/ 65.8 % \/ 73.9 and 82.8 % \/ 55.7 % \/ 66.5 on MUC6 and MUC-7 English coreference tasks respectively . This means that our system achieves significantly better precision rates by about 8 percent over the best-reported systems while keeping recall rates . ##P05-2011 Towards An Optimal Lexicalization In A Natural-Sounding Portable Natural Language Generator For Dialog Systems . In contrast to the latest progress in speech recognition , the state-of-the-art in natural language generation for spoken language dialog systems is lagging behind . The core dialog managers are now more sophisticated ; and natural-sounding and flexible output is expected , but not achieved with current simple techniques such as template-based systems . Portability of systems across subject domains and languages is another increasingly important requirement in dialog systems . This paper presents an outline of LEGEND , a system that is both portable and generates natural-sounding output . This goal is achieved through the novel use of existing lexical resources such as FrameNet and WordNet . ##W09-2402 Making Sense of Word Sense Variation . We present a pilot study of word-sense annotation using multiple annotators , relatively polysemous words , and a heterogenous corpus . Annotators selected senses for words in context , using an annotation interface that presented WordNet senses . Interannotator agreement -LRB- IA -RRB- results show that annotators agree well or not , depending primarily on the individual words and their general usage properties . Our focus is on identifying systematic differences across words and annotators that can account for IA variation . We identify three lexical use factors : semantic specificity of the context , sense concreteness , and similarity of senses . We discuss systematic differences in sense selection across annotators , and present the use of association rules to mine the data for systematic differences across annotators . ##P09-2047 Query Segmentation Based on Eigenspace Similarity . Query segmentation is essential to query processing . It aims to tokenize query words into several semantic segments and help the search engine to improve the precision of retrieval . In this paper , we present a novel unsupervised learning approach to query segmentation based on principal eigenspace similarity of queryword-frequency matrix derived from web statistics . Experimental results show that our approach could achieve superior performance of 35.8 % and 17.7 % in Fmeasure over the two baselines respectively , i.e. MI -LRB- Mutual Information -RRB- approach and EM optimization approach . ##P98-1035 Exploiting Syntactic Structure for Language Modeling . The paper presents a language model that develops syntactic structure and uses it to extract meaningful information from the word history , thus enabling the use of long distance dependencies . The model assigns probability to every joint sequence of words-binary-parse-structure with headword annotation and operates in a left-to-right manner - therefore usable for automatic speech recognition . The model , its probabilistic parameterization , and a set of experiments meant to evaluate its predictive power are presented ; an improvement over standard trigram modeling is achieved . ##C90-3072 Spelling-Checking For Highly Inflective Languages . Spelling-checkers have become an integral part of most text processing software . From different reasons among which the speed of processing prevails they are usually based on dictionaries of word forms instead of words . This approach is sufficient for languages with little inflection such as English , but fails for highly inflective languages such as Czech , Russian , Slovak or other Slavonic languages . We have developed a special method for describing inflection for the purpose of building spelling-checkers for such languages . The speed of the resulting program lies somewhere in the middle of the scale of existing spelling-checkers for English and the main dictionary fits into the standard 360K floppy , whereas the number of recognized word forms exceeds 6 million -LRB- for Czech -RRB- . Further , a special method has been developed for easy word classification . ##P92-1015 Prosodic Aids To Syntactic And Semantic Analysis Of Spoken English . Prosody can be useful in resolving certain lexical and structural ambiguities in spoken English . In this paper we present some results of employing two types of prosodic information , namely pitch and pause , to assist syntactic and semantic analysis during parsing . ##W99-0904 Unsupervised Learning Of Derivational Morphology From Inflectional Lexicons . We present in this paper an unsupervised method to learn suffixes and suffixation operations from an inflectional lexicon of a language . The elements acquired with our method are used to build stemming procedures and can assist lexicographers in the development of new lexical resources . ##W01-0908 Using The Distribution Of Performance For Studying Statistical NLP Systems And Corpora . Statistical NLP systems are frequently evaluated and compared on the basis of their performances on a single split of training and test data . Results obtained using a single split are , however , subject to sampling noise . In this paper we argue in favor of reporting a distribution of performance gures , obtained by resampling the training data , rather than a single number . The additional information from distributions can be used to make statistically quanti ed statements about di erences across parameter settings , systems , and corpora . ##W03-0907 Story Understanding Through Multi-Representation Model Construction . We present an implemented model of story understanding and apply it to the understanding of a children 's story . We argue that understanding a story consists of building multirepresentation models of the story and that story models are efficiently constructed using a satisfiability solver . We present a computer program that contains multiple representations of commonsense knowledge , takes a narrative as input , transforms the narrative and representations of commonsense knowledge into a satisfiability problem , runs a satisfiability solver , and produces models of the story as output . The narrative , models , and representations are expressed in the language of Shanahan 's event calculus . ##P07-1083 Alignment-Based Discriminative String Similarity . A character-based measure of similarity is an important component of many natural language processing systems , including approaches to transliteration , coreference , word alignment , spelling correction , and the identi cation of cognates in related vocabularies . We propose an alignment-based discriminative framework for string similarity . We gather features from substring pairs consistent with a character-based alignment of the two strings . This approach achieves exceptional performance ; on nine separate cognate identi cation experiments using six language pairs , we more than double the precision of traditional orthographic measures like Longest Common Subsequence Ratio and Dice 's Coef cient . We also show strong improvements over other recent discriminative and heuristic similarity functions . ##W09-1113 Learning Where to Look : Modeling Eye Movements in Reading . We propose a novel machine learning task that consists in learning to predict which words in a text are fixated by a reader . In a first pilot experiment , we show that it is possible to outperform a majority baseline using a transitionbased model with a logistic regression classifier and a very limited set of features . We also show that the model is capable of capturing frequency effects on eye movements observed in human readers . ##P04-3025 Incorporating Topic Information Into Semantic Analysis Models . This paper reports experiments in classifying texts based upon their favorability towards the subject of the text using a feature set enriched with topic information on a small dataset of music reviews hand-annotated for topic . The results of these experiments suggest ways in which incorporating topic information into such models may yield improvement over models which do not use topic information . ##D08-1026 Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems . One major bottleneck in conversational systems is their incapability in interpreting unexpected user language inputs such as out-ofvocabulary words . To overcome this problem , conversational systems must be able to learn new words automatically during human machine conversation . Motivated by psycholinguistic findings on eye gaze and human language processing , we are developing techniques to incorporate human eye gaze for automatic word acquisition in multimodal conversational systems . This paper investigates the use of temporal alignment between speech and eye gaze and the use of domain knowledge in word acquisition . Our experiment results indicate that eye gaze provides a potential channel for automatically acquiring new words . The use of extra temporal and domain knowledge can significantly improve acquisition performance . ##W04-2314 Bootstrapping Spoken Dialog Systems With Data Reuse . Building natural language spoken dialog systems requires large amounts of human transcribed and labeled speech utterances to reach useful operational service performances . Furthermore , the design of such complex systems consists of several manual steps . The User Experience -LRB- UE -RRB- expert analyzes and de nes by hand the system core functionalities : the system semantic scope -LRB- call-types -RRB- and the dialog manager strategy which will drive the human-machine interaction . This approach is extensive and error prone since it involves several non-trivial design decisions that can only be evaluated after the actual system deployment . Moreover , scalability is compromised by time , costs and the high level of UE know-how needed to reach a consistent design . We propose a novel approach for bootstrapping spoken dialog systems based on reuse of existing transcribed and labeled data , common reusable dialog templates and patterns , generic language and understanding models , and a consistent design process . We demonstrate that our approach reduces design and development time while providing an effective system without any application speci c data . ##W97-0213 A Perspective On Word Sense Disambiguation Methods And Their Evaluation . In this position paper , we make several observations about the state of the art in automatic word sense disambiguation . Motivated by these observations , we offer several specific proposals to the community regarding improved evaluation criteria , common training and testing resources , and the definition of sense inventories . ##P06-1009 Discriminative Word Alignment With Conditional Random Fields . In this paper we present a novel approach for inducing word alignments from sentence aligned data . We use a Conditional Random Field -LRB- CRF -RRB- , a discriminative model , which is estimated on a small supervised training set . The CRF is conditioned on both the source and target texts , and thus allows for the use of arbitrary and overlapping features over these data . Moreover , the CRF has efficient training and decoding processes which both find globally optimal solutions . We apply this alignment model to both French-English and Romanian-English language pairs . We show how a large number of highly predictive features can be easily incorporated into the CRF , and demonstratethatevenwithonlyafewhundred word-aligned training sentences , our model improves over the current state-ofthe-art with alignment error rates of 5.29 and 25.8 for the two tasks respectively . ##W00-1214 Machine Learning Methods For Chinese Web Page Categorization . This paper reports our evaluation of k Nearest Neighbor -LRB- kNN -RRB- , Support Vector Machines -LRB- SVM -RRB- , and Adaptive Resonance Associative Map -LRB- ARAM -RRB- on Chinese web page classification . Benchmark experiments based on a Chinese web corpus showed that their predictive performance were roughly comparable although ARAM and kNN slightly outperformed SVM in small categories . In addition , inserting rules into ARAM helped to improve performance , especially for small welldefined categories . ##I08-4025 Training a Perceptron with Global and Local Features for Chinese Word Segmentation . This paper proposes the use of global features for Chinese word segmentation . These global features are combined with local features using the averaged perceptron algorithm over N-best candidate word segmentations . The N-best candidates are produced using a conditional random field -LRB- CRF -RRB- character-based tagger for word segmentation . Our experiments show that by adding global features , performance is significantly improved compared to the character-based CRF tagger . Performance is also improved compared to using only local features . Our system obtains an F-score of 0.9355 on the CityU corpus , 0.9263 on the CKIP corpus , 0.9512 on the SXU corpus , 0.9296 on the NCC corpus and 0.9501 on the CTB corpus . All results are for the closed track in the fourth SIGHAN Chinese Word Segmentation Bakeoff . ##N07-4014 The Hidden Information State Dialogue Manager : A Real-World POMDP-Based System . The Hidden Information State -LRB- HIS -RRB- Dialogue System is the first trainable and scalable implementation of a spoken dialog system based on the PartiallyObservable Markov-Decision-Process -LRB- POMDP -RRB- model of dialogue . The system responds to n-best output from the speech recogniser , maintains multiple concurrent dialogue state hypotheses , and provides a visual display showing how competing hypotheses are ranked . The demo is a prototype application for the Tourist Information Domain and achieved a task completion rate of over 90 % in a recent user study . ##P09-1069 Learning a Compositional Semantic Parser using an Existing Syntactic Parser . We present a new approach to learning a semantic parser -LRB- a system that maps natural language sentences into logical form -RRB- . Unlikepreviousmethods , itexploitsanexisting syntactic parser to produce disambiguated parse trees that drive the compositional semantic interpretation . The resulting system produces improved results on standard corpora on natural language interfaces for database querying and simulated robot control . ##N09-1028 Using a Dependency Parser to Improve SMT for Subject-Object-Verb Languages . We introduce a novel precedence reordering approach based on a dependency parser to statistical machine translation systems . Similar to other preprocessing reordering approaches , our method can efficiently incorporate linguistic knowledge into SMT systems without increasing the complexity of decoding . For a set of five subject-object-verb -LRB- SOV -RRB- order languages , we show significant improvements in BLEU scores when translating from English , compared to other reordering approaches , in state-of-the-art phrase-based SMT systems . ##J89-2001 A Pragmatics-Based Approach To Ellipsis Resolution . Intersentential elliptical utterances occur frequently during information-seeking dialogs in task domains . This paper presents a pragmatics-based framework for interpreting such utterances . Discourse expectations and focusing heuristics are used to facilitate recognition of an information-seeker 's intent in uttering an elliptical fragment . The ellipsis is comprehended by identifying both the aspect of the information-seeker 's task-related plan highlighted by the fragment and the conversational discourse goal fulfilled by the utterance . The contribution of this approach is its consideration of pragmatic information , including discourse content and conversational goals , rather than just the precise representation of the preceding utterance . ##C88-1021 Anaphora Resolution : A Multi-Strategy Approach . Anaphora resolution has proven to be a very difficult problem ; it requires the integrated application of syntactic , semantic , and pragmatic knowledge . This paper examines the hypothesis that instead of attempting to construct a monolithic method for resolving anaphora , the combination of multiple strategies , each exploiting a different knowledge source , proves more effective , theoretically and computationally . Cognitive plausibility is established in that human judgements of the optimal anaphoric referent accord with those of the strategy-based method , and human inability to determine a unique referent corresponds to the cases where different strategies offer conflicting candidates for the anaphoric referent . ##N09-2044 Classifying Factored Genres with Part-of-Speech Histograms . This work addresses the problem of genre classification of text and speech transcripts , with the goal of handling genres not seen in training . Two frameworks employing different statistics on word\/POS histograms with a PCA transform are examined : a single model for each genre and a factored representation of genre . The impact of the two frameworks on the classification of training-matched and new genres is discussed . Results show that the factored models allow for a finer-grained representation of genre and can more accurately characterize genres not seen in training . ##W08-0616 Using Natural Language Processing to Classify Suicide Notes . We hypothesize that machine-learning algorithms -LRB- MLA -RRB- can classify completer and simulated suicide notes as well as mental health professionals -LRB- MHP -RRB- . Five MHPs classified 66 simulated or completer notes ; MLAs were used for the same task . Results : MHPs were accurate 71 % of the time ; using the sequential minimization optimization algorithm -LRB- SMO -RRB- MLAs were accurate 78 % of the time . There was no significant difference between the MLA and MPH classifiers . This is an important first step in developing an evidence based suicide predictor for emergency department use . ##W02-0309 Biomedical Text Retrieval In Languages With A Complex Morphology . Document retrieval in languages with a rich and complex morphology -- particularly in terms of derivation and -LRB- single-word -RRB- composition -- suffers from serious performance degradation with the stemming-only query-term-to-text-word matching paradigm . We propose an alternative approach in which morphologically complex word forms are segmented into relevant subwords -LRB- such as stems , named entities , acronyms -RRB- , and subwords constitute the basic unit for indexing and retrieval . We evaluate our approach on a large biomedical document collection . ##W05-1302 Adaptive String Similarity Metrics For Biomedical Reference Resolution . In this paper we present the evaluation of a set of string similarity metrics used to resolve the mapping from strings to concepts in the UMLS MetaThesaurus . String similarity is conceived as a single component in a full Reference Resolution System that would resolve such a mapping . Given this qualification , we obtain positive results achieving 73.6 F-measure -LRB- 76.1 precision and 71.4 recall -RRB- for the task of assigning the correct UMLS concept to a given string . Our results demonstrate that adaptive string similarity methods based on Conditional Random Fields outperform standard metrics in this domain . ##H05-1058 Part-Of-Speech Tagging Using Virtual Evidence And Negative Training . We present a part-of-speech tagger which introduces two new concepts : virtual evidence in the form of an observed child node , and negative training data to learn the conditional probabilities for the observed child . Associated with each word is a exible feature-set which can include binary ags , neighboring words , etc. . The conditional probability of Tag given Word + Features is implemented using a factored language-model with back-off to avoid data sparsity problems . This model remains within the framework of Dynamic Bayesian Networks -LRB- DBNs -RRB- and is conditionally-structured , but resolves the label bias problem inherent in the conditional Markov model -LRB- CMM -RRB- . ##W04-3013 Context Sensing Using Speech And Common Sense . We present a method of inferring aspects of a person 's context by capturing conversation topics and using prior knowledge of human behavior . This paper claims that topic-spotting performance can be improved by using a large database of common sense knowledge . We describe two systems we built to infer context from noisy transcriptions of spoken conversations using common sense , and detail some preliminary results . The GISTER system uses OMCSNet , a commonsense semantic network , to infer the most likely topics under discussion in a conversation stream . The OVERHEAR system is built on top of GISTER , and distinguishes between aspects of the conversation that refer to past , present , and future events by using LifeNet , a probabilistic graphical model of human behavior , to help infer the events that occurred in each of those three time periods . We conclude by discussing some of the future directions we may take this work . ##P04-3003 Constructing Transliteration Lexicons From Web Corpora . This paper proposes a novel approach to automating the construction of transliterated-term lexicons . A simple syllable alignment algorithm is used to construct confusion matrices for cross-language syllable-phoneme conversion . Each row in the confusion matrix consists of a set of syllables in the source language that are -LRB- correctly or erroneously -RRB- matched phonetically and statistically to a syllable in the target language . Two conversions using phoneme-to-phoneme and text-to-phoneme syllabification algorithms are automatically deduced from a training corpus of paired terms and are used to calculate the degree of similarity between phonemes for transliterated-term extraction . In a large-scale experiment using this automated learning process for conversions , more than 200,000 transliterated-term pairs were successfully extracted by analyzing query results from Internet search engines . Experimental results indicate the proposed approach shows promise in transliterated-term extraction . ##N07-2039 Reversible Sound-to-Letter\/Letter-to-Sound Modeling Based on Syllable Structure . This paper describes a new grapheme-tophoneme framework , based on a combination of formal linguistic and statistical methods . A context-free grammar is used to parse words into their underlying syllable structure , and a set of subword `` spellneme '' units encoding both phonemic and graphemic information can be automatically derived from the parsed words . A statistical a1 - gram model can then be trained on a large lexicon of words represented in terms of these linguistically motivated subword units . The framework has potential applications in modeling unknown words and in linking spoken spellings with spoken pronunciations for fully automatic new-word acquisition via dialogue interaction . Results are reported on sound-to-letter experiments for the nouns in the Phonebook corpus . ##W98-1117 A Maximum-Entropy Partial Parser For Unrestricted Text . This paper describes a partial parser that assigns syntactic structures to sequences of partof-speech tags . The program uses the maximum entropy parameter estimation method , which Mlows a flexible combination of different knowledge sources : the hierarchical structure , parts of speech and phrasal categories . In effect , the parser goes beyond simple bracketing and recognizes even fairly complex structures . We give accuracy figures for different applications of the parser . ##C92-4199 Recognizing Unregistered Names For Mandarin Word Identification . Word Identification has been an important and active issue in Chinese Natural Language Processing . In this paper , a new mechanism , based on the concept of sublanguage , is proposed for identifying unknown words , especially personal names , in Chinese newspapers . The proposed mechanism includes title . driven name recognition , adaptive dynamic word formation , identification of Z-character and 3-character Chinese names without title . We will show the e ~ : perimental results for two corpora and compare them with the results by the NTIIU 's statistic-based system , the only system that we know has attacked the same problem . The ezperimental results have shown significant improvements over the WI systems without the name identification capability . ##P96-1020 Pattern-Based Context-Free Grammars For Machine Translation . This paper proposes the use of `` patternbased '' context-free grammars as a basis for building machine translation -LRB- MT -RRB- systems , which are now being adopted as personal tools by a broad range of users in the cyberspace society . We discuss major requirements for such tools , including easy customization for diverse domains , the efficiency of the translation algorithm , and scalability -LRB- incremental improvement in translation quality through user interaction -RRB- , and describe how our approach meets these requirements . ##N09-3005 Using Language Modeling to Select Useful Annotation Data . An annotation project typically has an abundant supply of unlabeled data that can be drawn from some corpus , but because the labeling process is expensive , it is helpful to pre-screen the pool of the candidate instances based on some criterion of future usefulness . In many cases , that criterion is to improve the presence of the rare classes in the data to be annotated . We propose a novel method for solving this problem and show that it compares favorably to a random sampling baseline and a clustering algorithm . ##P07-3016 Clustering Hungarian Verbs on the Basis of Complementation Patterns . Our paper reports an attempt to apply an unsupervised clustering algorithm to a Hungarian treebank in order to obtain semantic verb classes . Starting from the hypothesis that semantic metapredicates underlie verbs ' syntactic realization , we investigate how one can obtain semantically motivated verb classes by automatic means . The 150 most frequent Hungarian verbs were clustered on the basis of their complementation patterns , yielding a set of basic classes and hints about the features that determine verbal subcategorization . The resulting classes serve as a basis for the subsequent analysis of their alternation behavior . ##I05-5003 Using Machine Translation Evaluation Techniques to Determine Sentence-level Semantic Equivalence . The task of machine translation -LRB- MT -RRB- evaluation is closely related to the task of sentence-level semantic equivalence classification . This paper investigates the utility of applying standard MT evaluation methods -LRB- BLEU , NIST , WER and PER -RRB- to building classifiers to predict semantic equivalence and entailment . We also introduce a novel classification method based on PER which leverages part of speech information of the words contributing to the word matches and non-matches in the sentence . Our results show that MT evaluation techniques are able to produce useful features for paraphrase classification and to a lesser extent entailment . Our technique gives a substantial improvement in paraphrase classification accuracy over all of the other models used in the experiments . ##C04-1022 Automatic Learning Of Language Model Structure . Statistical language modeling remains a challenging task , in particular for morphologically rich languages . Recently , new approaches based on factored language models have been developed to address this problem . These models provide principled ways of including additional conditioning variables other than the preceding words , such as morphological or syntactic features . However , the number of possible choices for model parameters creates a large space of models that can not be searched exhaustively . This paper presents an entirely data-driven model selection procedure based on genetic search , which is shown to outperform both knowledge-based and random selection procedures on two di erent language modeling tasks -LRB- Arabic and Turkish -RRB- . ##N09-2016 Learning Bayesian Networks for Semantic Frame Composition in a Spoken Dialog System . A stochastic approach based on Dynamic Bayesian Networks -LRB- DBNs -RRB- is introduced for spoken language understanding . DBN-based models allow to infer and then to compose semantic frame-based tree structures from speech transcriptions . Experimental results on the French MEDIA dialog corpus show the appropriateness of the technique which both lead to good tree identification results and can provide the dialog system with n-best lists of scored hypotheses . ##D08-1008 Dependency-based Semantic Role Labeling of PropBank . We present a PropBank semantic role labeling system for English that is integrated with a dependency parser . To tackle the problem of joint syntactic -- semantic analysis , the system relies on a syntactic and a semantic subcomponent . The syntactic model is a projective parser using pseudo-projective transformations , and the semantic model uses global inference mechanisms on top of a pipeline of classifiers . The complete syntactic -- semantic output is selected from a candidate pool generated by the subsystems . We evaluate the system on the CoNLL2005 test sets using segment-based and dependency-based metrics . Using the segment-based CoNLL-2005 metric , our system achieves a near state-of-the-art F1 figure of 77.97 on the WSJ+B rown test set , or 78.84 if punctuation is treated consistently . Using a dependency-based metric , the F1 figure of our system is 84.29 on the test set from CoNLL-2008 . Our system is the first dependency-based semantic role labeler for PropBank that rivals constituent-based systems in terms of performance . ##J98-4003 Machine Transliteration . It is challenging to translate names and technical terms across languages with different alphabets and sound inventories . These items are commonly transliterated , i.e. , replaced with approximate phonetic equivalents . For example , `` computer '' in English comes out as `` konpyuutaa '' in Japanese . Translating such items from Japanese back to English is even more challenging , and of practical interest , as transliterated items make up the bulk of text phrases not found in bilingual dictionaries . We describe and evaluate a method for performing backwards transliterations by machine . This method uses a generative model , incorporating several distinct stages in the transliteration process . ##W09-0439 Stabilizing Minimum Error Rate Training . The most commonly used method for training feature weights in statistical machine translation -LRB- SMT -RRB- systems is Och 's minimum error rate training -LRB- MERT -RRB- procedure . Awell-knownproblemwithOch 's procedure is that it tends to be sensitive to small changes in the system , particularly when the number of features is large . In this paper , we quantify the stability of Och 's procedure by supplying different random seeds to a core component of the procedure -LRB- Powell 's algorithm -RRB- . We show that for systems with many features , there is extensive variation in outcomes , both on the development data and on the test data . Weanalyzethecausesofthisvariationand proposemodificationstotheMERTprocedure that improve stability while helping performance on test data . ##I05-1001 A New Method for Sentiment Classification in Text Retrieval . Traditional text categorization is usually a topic-based task , but a subtle demand on information retrieval is to distinguish between positive and negative view on text topic . In this paper , a new method is explored to solve this problem . Firstly , a batch of Concerned Concepts in the researched domain is predefined . Secondly , the special knowledge representing the positive or negative context of these concepts within sentences is built up . At last , an evaluating function based on the knowledge is defined for sentiment classification of free text . We introduce some linguistic knowledge in these procedures to make our method effective . As a result , the new method proves better compared with SVM when experimenting on Chinese texts about a certain topic . ##W06-3811 Synonym Extraction Using A Semantic Distance On A Dictionary . Synonyms extraction is a difficult task to achieve and evaluate . Some studies have tried to exploit general dictionaries for that purpose , seeing them as graphs where words are related by the definition they appear in , in a complex network of an arguably semantic nature . The advantage of using a general dictionary lies in the coverage , and the availability of such resources , in general and also in specialised domains . We present here a method exploiting such a graph structure to compute a distance between words . This distance is used to isolate candidate synonyms for a given word . We present an evaluation of the relevance of the candidates on a sample of the lexicon . ##C94-1027 Part-Of-Speech Tagging With Neural Networks . Text corpora which are tagged with part-of-speech information are useful in many areas of linguistic research . In this paper , a new part-of-speech tagging method hased on neural networks -LRB- Net-Tagger -RRB- is presented and its performance is compared to that of a llMM-tagger -LRB- Cutting et al. , 1992 -RRB- and a trigrambased tagger -LRB- Kempe , 1993 -RRB- . It is shown that the Net-Tagger performs as well as the trigram-based tagger and better than the iIMM-tagger . ##P05-1046 Unsupervised Learning Of Field Segmentation Models For Information Extraction . The applicability of many current information extraction techniques is severely limited by the need for supervised training data . We demonstrate that for certain field structured extraction tasks , such as classified advertisements and bibliographic citations , small amounts of prior knowledge can be used to learn effective models in a primarily unsupervised fashion . Although hidden Markov models -LRB- HMMs -RRB- provide a suitable generative model for field structured text , general unsupervised HMM learning fails to learn useful structure in either of our domains . However , one can dramatically improve the quality of the learned structure by exploiting simple prior knowledge of the desired solutions . In both domains , we found that unsupervised methods can attain accuracies with 400 unlabeled examples comparable to those attained by supervised methods on 50 labeled examples , and that semi-supervised methods can make good use of small amounts of labeled data . ##W03-1014 Learning Extraction Patterns For Subjective Expressions . This paper presents a bootstrapping process that learns linguistically rich extraction patterns for subjective -LRB- opinionated -RRB- expressions . High-precision classifiers label unannotated data to automatically create a large training set , which is then given to an extraction pattern learning algorithm . The learned patterns are then used to identify more subjective sentences . The bootstrapping process learns many subjective patterns and increases recall while maintaining high precision . ##P95-1038 Evaluation Of Semantic Clusters . Semantic clusters of a domain form an important feature that can be useful for performing syntactic and semantic disambiguation . Several attempts have been made to extract the semantic clusters of a domain by probabilistic or taxonomic techniques . However , not much progress has been made in evaluating the obtained semantic clusters . This paper focuses on an evaluation mechanism that can be used to evaluate semantic clusters produced by a system against those provided by human experts . ##P98-1017 An Efficient Kernel for Multilingual Generation in Speech-to-Speech Dialogue Translation . We present core aspects of a fully implemented generation component in a multilingual speechto-speech dialogue translation system . Its design was particularly influenced by the necessity of real-time processing and usability for multiple languages and domains . We developed a general kernel system comprising a microplanning and a syntactic realizer module . Tile microplanner performs lexical and syntactic choice , based on constraint-satisfaction techniques . The syntactic realizer processes HPSG grammars reflecting the latest developments of the underlying linguistic theory , utilizing their pre-processing into the TAG formalism . The declarative nature of the knowledge bases , i.e. , the microplanning constraints and the HPSG grammars allowed an easy adaption to new domains and languages . The successful integration of our component into the translation system Verbmobil proved the fulfillment of the specific real-time constraints . ##P08-1084 Unsupervised Multilingual Learning for Morphological Segmentation . For centuries , the deep connection between languages has brought about major discoveries about human communication . In this paper we investigate how this powerful source of information can be exploited for unsupervised language learning . In particular , we study the task of morphological segmentation of multiple languages . We present a nonparametric Bayesian model that jointly induces morpheme segmentations of each language under consideration and at the same time identifies cross-lingual morpheme patterns , or abstract morphemes . We apply our modeltothreeSemiticlanguages : Arabic , Hebrew , Aramaic , as well as to English . Our results demonstrate that learning morphological models in tandem reduces error by up to 24 % relative to monolingual models . Furthermore , we provide evidence that our joint model achieves better performance when applied to languages from the same family . ##P09-1067 Variational Decoding for Statistical Machine Translation . Statistical models in machine translation exhibit spurious ambiguity . That is , the probability of an output string is split among many distinct derivations -LRB- e.g. , trees or segmentations -RRB- . In principle , the goodness of a string is measured by the total probability of its many derivations . However , finding the best string -LRB- e.g. , during decoding -RRB- is then computationally intractable . Therefore , most systems use a simple Viterbi approximation that measures the goodness of a string using only its most probable derivation . Instead , we develop a variational approximation , which considers all the derivations but still allows tractable decoding . Our particular variational distributions are parameterized as n-gram models . We also analytically show that interpolating thesen-gram models for different n is similar to minimumrisk decoding for BLEU -LRB- Tromble et al. , 2008 -RRB- . Experiments show that our approach improves the state of the art . ##W04-1607 Finite-State Morphological Analysis Of Persian . This paper describes a two-level morphological analyzer for Persian using a system based on the Xerox finite state tools . Persian language presents certain challenges to computational analysis : There is a complex verbal conjugation paradigm which includes long-distance morphological dependencies ; phonological alternations apply at morpheme boundaries ; word and noun phrase boundaries are difficult to define since morphemes may be detached from their stems and distinct words can appear without an intervening space . In this work , we develop these problems and provide solutions in a finitestate morphology system . ##C96-2159 Decision Tree Learning Algorithm With Structured Attributes : Application To Verbal Case Frame Acquisition . The Decision Tree Learning Algorithms -LRB- DTLAs -RRB- are getting keen attention from the natural language processing research comlnunity , and there have been a series of attempts to apply them to verbal case frame acquisition . However , a DTLA can not handle structured attributes like nouns , which are classified under a thesaurus . In this paper , we present a new DTLA that can rationally handle the structured attributes . In the process of tree generation , the algorithm generalizes each attribute optimally using a given thesaurus . We apply this algorithm to a bilingual corpus and show that it successfiflly learned a generalized decision tree for classifying the verb `` take '' and that the tree was smaller with more prediction power on the open data than the tree learned by the conventional DTLA . ##C04-1030 Reordering Constraints For Phrase-Based Statistical Machine Translation . In statistical machine translation , the generation of a translation hypothesis is computationally expensive . If arbitrary reorderings are permitted , the search problem is NP-hard . On the other hand , if we restrict the possible reorderings in an appropriate way , we obtain a polynomial-time search algorithm . We investigate different reordering constraints for phrase-based statistical machine translation , namely the IBM constraints and the ITG constraints . We present efficient dynamic programming algorithms for both constraints . We evaluate the constraints with respect to translation quality on two Japanese -- English tasks . We show that the reordering constraints improve translation quality compared to an unconstrained search that permits arbitrary phrase reorderings . The ITG constraints preform best on both tasks and yield statistically significant improvements compared to the unconstrained search . ##J96-3003 Efficient Multilingual Phoneme-To-Grapheme Conversion Based On HMM . Grapheme-to-phoneme conversion -LRB- GTPC -RRB- has been achieved in most European languagesby dictionary look-up or using rules . The application of these methods , however , in the reverse process , -LRB- i.e. , in phoneme-to-grapheme conversion -LRB- PTGC -RRB- -RRB- creates serious problems , especially in inflectionally rich languages . In this paper the PTGC problem is approached from a completely different point of view . Instead of rules or a dictionary , the statistics of language connecting pronunciation to spelling are exploited . The novelty lies in modeling the natural language intraword features using the theory of hidden Markov models -LRB- HMM -RRB- and performing the conversion using the Viterbi algorithm . The PTGC system has been established and tested on various multilingual corpora . Initially , the first-order HMM and the common Viterbi algorithm were used to obtain a single transcription for each word . Afterwards , the second-order HMM and the N-best algorithm adapted to PTGC were implemented to provide one or more transcriptions for each word input -LRB- homophones -RRB- . This system gave an average score of more than 99 % correctly transcribed words -LRB- overall success in the first four candidates -RRB- for most of the seven languages it was tested on -LRB- Dutch , English , French , German , Greek , Italian , and Spanish -RRB- . The system can be adapted to almost any language with little effort and can be implemented in hardware to serve in real-time speech recognition systems . ##W06-2207 A Hybrid Approach For The Acquisition Of Information Extraction Patterns . In this paper we present a hybrid approach for the acquisition of syntacticosemantic patterns from raw text . Our approach co-trains a decision list learner whose feature space covers the set of all syntactico-semantic patterns with an Expectation Maximization clustering algorithm that uses the text words as attributes . We show that the combination of the two methods always outperforms the decision list learner alone . Furthermore , using a modular architecture we investigate several algorithms for pattern ranking , the most important component of the decision list learner . ##I05-1066 Automatic Slide Generation Based on Discourse Structure Analysis . In this paper , we describe a method of automatically generating summary slides from a text . The slides are generated by itemizing topic\/non-topic parts that are extracted from the text based on syntactic\/case analysis . The indentations of the items are controlled according to the discourse structure , which is detected by cue phrases , identi cation of word chain and similarity between two sentences . Our experiments demonstrates generated slides are far easier to read in comparison with original texts .