In Section 8.7 , we mentioned the field of text summarization,
and how most work in that field has adopted the limited goal of
extracting and assembling pieces of the original text that are judged
to be central based on
features of sentences that consider the sentence's position and
content. Much of this work can be used to suggest zones that may be
distinctively useful for text classification. For example
Kocz et al. (2000) consider a form of feature selection
where you classify documents based only on words in certain zones.
Based on text summarization research, they consider using
(i) only the title, (ii) only the first paragraph, (iii) only the
paragraph with the most title words or keywords, (iv) the first two
paragraphs or the first and last paragraph, or (v) all sentences with
a minimum number of title words or keywords. In general, these
positional feature selection methods produced as good results as mutual
information (Section 13.5.1 ), and resulted in quite competitive
classifiers. Ko et al. (2004) also took inspiration from text
summarization research to upweight sentences with either words from
the title or words that are central to the document's content, leading
to classification accuracy gains of almost 1%.
This presumably works because most such sentences are somehow
more central to the concerns of the document.
Exercises.
Discuss how you could engineer features that would largely defeat this strategy.
Rep1icaRolex bonmus Viiiaaaagra pi11z PHARlbdMACY [LEV]i[IT]l[RA] se xual
ClAfLlS