Hypertext on the Web, together with user activity accessing it, are rich sources of data to analyze. Some of the specific features are text on pages, hyperlink graph, anchor text, topic directories, surfing history, and bookmark structure. In this talk we will discuss two important steps to exploiting these information sources. First, we will present a system called Memex which can form community-specific topic taxonomies from bookmarking and surfing behavior. Second, we will describe a system called the Focused crawler, which learns from such a taxonomy to perform automatic resource discovery on the Web.
These systems depend on implicit locality of content on the Web and interesting statistical relations between page content and link structure. Some of the techniques we will describe may have implications for analyzing linguistic knowledge represented as graphs, such as WordNet.
We believe that such community-based divide-and-conquer of content is essential to improve coverage and query handling for the Web, beyond the current generation of keyword-matching search engines.