Spam

Next: Advertising as the economic Up: Web characteristics Previous: The web graph Contents Index

Spam

Early in the history of web search, it became clear that web search engines were an important means for connecting advertisers to prospective buyers. A user searching for maui golf real estate is not merely seeking news or entertainment on the subject of housing on golf courses on the island of Maui, but instead likely to be seeking to purchase such a property. Sellers of such property and their agents, therefore, have a strong incentive to create web pages that rank highly on this query. In a search engine whose scoring was based on term frequencies, a web page with numerous repetitions of maui golf real estate would rank highly. This led to the first generation of spam , which (in the context of web search) is the manipulation of web page content for the purpose of appearing high up in search results for selected keywords. To avoid irritating users with these repetitions, sophisticated spammers resorted to such tricks as rendering these repeated terms in the same color as the background. Despite these words being consequently invisible to the human user, a search engine indexer would parse the invisible words out of the HTML representation of the web page and index these words as being present in the page.

At its root, spam stems from the heterogeneity of motives in content creation on the Web. In particular, many web content creators have commercial motives and therefore stand to gain from manipulating search engine results. You might argue that this is no different from a company that uses large fonts to list its phone numbers in the yellow pages; but this generally costs the company more and is thus a fairer mechanism. A more apt analogy, perhaps, is the use of company names beginning with a long string of A's to be listed early in a yellow pages category. In fact, the yellow pages' model of companies paying for larger/darker fonts has been replicated in web search: in many search engines, it is possible to pay to have one's web page included in the search engine's index - a model known as paid inclusion . Different search engines have different policies on whether to allow paid inclusion, and whether such a payment has any effect on ranking in search results.

Search engines soon became sophisticated enough in their spam detection to screen out a large number of repetitions of particular keywords. Spammers responded with a richer set of spam techniques, the best known of which we now describe. The first of these techniques is cloaking, shown in Figure 19.5 . Here, the spammer's web server returns different pages depending on whether the http request comes from a web search engine's crawler (the part of the search engine that gathers web pages, to be described in Chapter 20 ), or from a human user's browser. The former causes the web page to be indexed by the search engine under misleading keywords. When the user searches for these keywords and elects to view the page, he receives a web page that has altogether different content than that indexed by the search engine. Such deception of search indexers is unknown in the traditional world of information retrieval; it stems from the fact that the relationship between page publishers and web search engines is not completely collaborative.

**Figure 19.5:** Cloaking as used by spammers.
$\includegraphics[width=10cm]{cloaking.eps}$

A doorway page contains text and metadata carefully chosen to rank highly on selected search keywords. When a browser requests the doorway page, it is redirected to a page containing content of a more commercial nature. More complex spamming techniques involve manipulation of the metadata related to a page including (for reasons we will see in Chapter 21 ) the links into a web page. Given that spamming is inherently an economically motivated activity, there has sprung around it an industry of Search Engine Optimizers , or SEOs to provide consultancy services for clients who seek to have their web pages rank highly on selected keywords. Web search engines frown on this business of attempting to decipher and adapt to their proprietary ranking techniques and indeed announce policies on forms of SEO behavior they do not tolerate (and have been known to shut down search requests from certain SEOs for violation of these). Inevitably, the parrying between such SEOs (who gradually infer features of each web search engine's ranking methods) and the web search engines (who adapt in response) is an unending struggle; indeed, the research sub-area of adversarial information retrieval has sprung up around this battle. To combat spammers who manipulate the text of their web pages is the exploitation of the link structure of the Web - a technique known as link analysis. The first web search engine known to apply link analysis on a large scale (to be detailed in Chapter 21 ) was Google, although all web search engines currently make use of it (and correspondingly, spammers now invest considerable effort in subverting it - this is known as link spam ).

Exercises.

If the number of pages with in-degree is proportional to $1/i^{2.1}$ , what is the probability that a randomly chosen web page has in-degree ?
If the number of pages with in-degree is proportional to $1/i^{2.1}$ , what is the average in-degree of a web page?
If the number of pages with in-degree is proportional to $1/i^{2.1}$ , then as the largest in-degree goes to infinity, does the fraction of pages with in-degree grow, stay the same, or diminish? How would your answer change for values of the exponent other than ?
The average in-degree of all nodes in a snapshot of the web graph is 9. What can we say about the average out-degree of all nodes in this snapshot?

Next: Advertising as the economic Up: Web characteristics Previous: The web graph Contents Index

© 2008 Cambridge University Press
This is an automatically generated page. In case of formatting errors you may want to look at the PDF edition of the book.
2009-04-07