Text analysis as a noisy source/channel decoding problem

Paul Taylor
University of Cambridge

Abstract

This talk address the issue of what exactly "text" is, and how best we should model its behaviour. The starting point is a departure from two widely held views as to what the nature of text or writing actually is. In the first view, "text" and "words" are confused in that a sentence of text is seen as comprising a sequence of words. The view sees text as the written representation of natural language where writing is thought of as a process of visually recooding the types of things that we say when we speak.

This work proposes a different approach. Firstly, we adopt a noisy source/channel model which says that words are an input to an encoding process whose output is noisy text. The primary task then in text processing is to decode the text and uncover the words in a statistical manner. The advantage of this technique is a single mechanism which deals with many of the common text segmentation and disambiguation tasks, including tokenisation, sentence splitting, POS tagging, spelling checking and homograph/homonym disambiguation. Secondly, we challenage the assumption that text encodes natural language. Rather we see text as a written encoding of a number of different semiotic systems of which language is just one; hence we we encounter numbers, dates, mathematics and email addresses, we should adopt a similar but parallel approach to that used for "real" natural language.