At its root, spam stems from the heterogeneity of motives in content creation on the Web. In particular, many web content creators have commercial motives and therefore stand to gain from manipulating search engine results. You might argue that this is no different from a company that uses large fonts to list its phone numbers in the yellow pages; but this generally costs the company more and is thus a fairer mechanism. A more apt analogy, perhaps, is the use of company names beginning with a long string of A's to be listed early in a yellow pages category. In fact, the yellow pages' model of companies paying for larger/darker fonts has been replicated in web search: in many search engines, it is possible to pay to have one's web page included in the search engine's index - a model known as paid inclusion . Different search engines have different policies on whether to allow paid inclusion, and whether such a payment has any effect on ranking in search results.
Search engines soon became sophisticated enough in their spam detection to screen out a large number of repetitions of particular keywords. Spammers responded with a richer set of spam techniques, the best known of which we now describe. The first of these techniques is cloaking, shown in Figure 19.5 . Here, the spammer's web server returns different pages depending on whether the http request comes from a web search engine's crawler (the part of the search engine that gathers web pages, to be described in Chapter 20 ), or from a human user's browser. The former causes the web page to be indexed by the search engine under misleading keywords. When the user searches for these keywords and elects to view the page, he receives a web page that has altogether different content than that indexed by the search engine. Such deception of search indexers is unknown in the traditional world of information retrieval; it stems from the fact that the relationship between page publishers and web search engines is not completely collaborative.
A doorway page contains text and metadata carefully chosen to rank highly on selected search keywords. When a browser requests the doorway page, it is redirected to a page containing content of a more commercial nature. More complex spamming techniques involve manipulation of the metadata related to a page including (for reasons we will see in Chapter 21 ) the links into a web page. Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers , or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. Web search engines frown on this business of attempting to decipher and adapt to their proprietary ranking techniques and indeed announce policies on forms of SEO behavior they do not tolerate (and have been known to shut down search requests from certain SEOs for violation of these). Inevitably, the parrying between such SEOs (who gradually infer features of each web search engine's ranking methods) and the web search engines (who adapt in response) is an unending struggle; indeed, the research sub-area of adversarial information retrieval has sprung up around this battle. To combat spammers who manipulate the text of their web pages is the exploitation of the link structure of the Web - a technique known as link analysis. The first web search engine known to apply link analysis on a large scale (to be detailed in Chapter 21 ) was Google, although all web search engines currently make use of it (and correspondingly, spammers now invest considerable effort in subverting it - this is known as link spam ).
Exercises.