The text classification problem

In text classification, we are given a description $\onedoc \in \mathbb{X}$ of a document, where $\mathbb{X}$ is the document space ; and a fixed set of classes $\mathbb{C} = \{ c_1,c_2,\ldots,c_J \}$ . Classes are also called categories or labels . Typically, the document space $\mathbb{X}$ is some type of high-dimensional space, and the classes are human defined for the needs of an application, as in the examples China and documents that talk about multicore computer chips above. We are given a training set $\docsetlabeled$ of labeled documents $\onedoclabeled$ , where $\onedoclabeled \in \mathbb{X} \times \mathbb{C}$ . For example:

$\begin{displaymath} \onedoclabeled=\langle\mbox{Beijing joins the World Trade Organization},\class{China}\rangle \end{displaymath}$

(111)

Using a learning method or learning algorithm , we then wish to learn a classifier or classification function $\gamma$ that maps documents to classes:

$\begin{displaymath}\gamma: \mathbb{X} \rightarrow \mathbb{C} \end{displaymath}$

(112)

This type of learning is called supervised learning because a supervisor (the human who defines the classes and labels training documents) serves as a teacher directing the learning process. We denote the supervised learning method by $\Gamma$ and write $\Gamma(\docsetlabeled) = \gamma$ . The learning method $\Gamma$ takes the training set $\docsetlabeled$ as input and returns the learned classification function $\gamma$ .

Most names for learning methods $\Gamma$ are also used for classifiers $\gamma$ . We talk about the Naive Bayes (NB) learning method $\Gamma$ when we say that ``Naive Bayes is robust,'' meaning that it can be applied to many different learning problems and is unlikely to produce classifiers that fail catastrophically. But when we say that ``Naive Bayes had an error rate of 20%,'' we are describing an experiment in which a particular NB classifier $\gamma$ (which was produced by the NB learning method) had a 20% error rate in an application.

Figure 13.1 shows an example of text classification from the Reuters-RCV1 collection, introduced in Section 4.2 , page 4.2 . There are six classes (UK, China, ..., sports), each with three training documents. We show a few mnemonic words for each document's content. The training set provides some typical examples for each class, so that we can learn the classification function $\gamma$ . Once we have learned $\gamma$ , we can apply it to the test set (or test data ), for example, the new document first private Chinese airline whose class is unknown. In Figure 13.1 , the classification function assigns the new document to class $\gamma(\onedoc) =$ China, which is the correct assignment.

The classes in text classification often have some interesting structure such as the hierarchy in Figure 13.1 . There are two instances each of region categories, industry categories, and subject area categories. A hierarchy can be an important aid in solving a classification problem; see Section 15.3.2 for further discussion. Until then, we will make the assumption in the text classification chapters that the classes form a set with no subset relationships between them.

**Figure 13.1:** Classes, training set, and test set in text classification .
$\begin{figure}\par \psset{unit=0.8cm} \par \begin{pspicture}(-3,-6)(13,3) \par \... ...8)(3.05,2.3)(1.75,1.15) \rput[c](10.9,0.0) \par \end{pspicture}\par \end{figure}$

Definition eqn:gammadef stipulates that a document is a member of exactly one class. This is not the most appropriate model for the hierarchy in Figure 13.1 . For instance, a document about the 2008 Olympics should be a member of two classes: the China class and the sports class. This type of classification problem is referred to as an any-of problem and we will return to it in Section 14.5 (page

). For the time being, we only consider one-of problems where a document is a member of exactly one class.

Our goal in text classification is high accuracy on test data or new data - for example, the newswire articles that we will encounter tomorrow morning in the multicore chip example. It is easy to achieve high accuracy on the training set (e.g., we can simply memorize the labels). But high accuracy on the training set in general does not mean that the classifier will work well on new data in an application. When we use the training set to learn a classifier for test data, we make the assumption that training data and test data are similar or from the same distribution. We defer a precise definition of this notion to Section 14.6 (page