Classification for classes that are not mutually exclusive
is called any-of , multilabel , or multivalue classification . In this
case, a document can belong to several classes
simultaneously, or to a single class, or to none of the
classes. A decision on one class leaves all options open
for the others. It is sometimes said that the classes are
independent of each other, but this is misleading
since the classes are rarely statistically independent in
the sense defined on page 13.5.2 . In terms
of the formal definition of the classification problem in
Equation 112 (page 112 ), we learn
different classifiers
in any-of classification,
each returning either
or
:
.
Solving an any-of classification task with linear classifiers is straightforward:
The second type of classification with more than two classes
is
one-of classification . Here, the classes are
mutually exclusive.
Each document must belong to exactly one of
the classes.
One-of classification is also called
multinomial ,
polytomous ,
multiclass ,
or
single-label classification .
Formally, there is a single classification function
in
one-of classification whose range is
, i.e.,
.
kNN is a (nonlinear) one-of classifier.
True one-of problems are less common in text classification than any-of problems. With classes like UK, China, poultry, or coffee, a document can be relevant to many topics simultaneously - as when the prime minister of the UK visits China to talk about the coffee and poultry trade.
Nevertheless, we will often make a one-of assumption, as we did in Figure 14.1 , even if classes are not really mutually exclusive. For the classification problem of identifying the language of a document, the one-of assumption is a good approximation as most text is written in only one language. In such cases, imposing a one-of constraint can increase the classifier's effectiveness because errors that are due to the fact that the any-of classifiers assigned a document to either no class or more than one class are eliminated.
assigned class | money-fx | trade | interest | wheat | corn | grain | |
true class | |||||||
money-fx | 95 | 0 | 10 | 0 | 0 | 0 | |
trade | 1 | 1 | 90 | 0 | 1 | 0 | |
interest | 13 | 0 | 0 | 0 | 0 | 0 | |
wheat | 0 | 0 | 1 | 34 | 3 | 7 | |
corn | 1 | 0 | 2 | 13 | 26 | 5 | |
grain | 0 | 0 | 2 | 14 | 5 | 10 |
An important tool for analyzing the performance of a
classifier for classes is the confusion matrix . The
confusion matrix shows for each pair of classes
, how many
documents from
were incorrectly assigned to
. In
Table 14.5 ,
the classifier manages to distinguish the three
financial classes money-fx,
trade, and
interest from the three agricultural classes
wheat,
corn, and
grain, but makes many errors within these two
groups.
The confusion matrix can help pinpoint opportunities
for improving
the accuracy of the system. For example, to address the
second largest error in Table 14.5 (14 in the row grain), one could attempt to
introduce features that distinguish wheat documents
from grain documents.
Exercises.