Turnbull on WWW collaborative filtering

In Uncategorized

Augmenting Information Seeking on the World Wide Web Using Collaborative Filtering Techniques

Don Turnbull

Table of Contents

1. Introduction

The internet has opened a channel of access to a interwoven labyrinth of information over an almost ubiquitous platform - the World Wide Web (the Web). Graphical Web browsers have enabled all types of users to access and share information with one another. However, once the initial thrill of Web access is over, most users don't surf the web, they use it as an information source.

This paper seeks to take Information Seeking research and apply it as a framework for understanding the World Wide Web environment and to identify opportunities for augmenting information seeking by applying Bibliometric analysis, filtering techniques, and collaborative technologies to Web usage data that can, in turn leverage a Web user's Information Seeking behavior.

1.1 Overview

This paper reviews several areas of study in order to form an extensive view of the issues involved in understanding and improving how a World Wide Web browser user (Web user) can discover new information on the World Wide Web.

There are seven main sections to this paper:

Section 1: Introduction
This Introduction and Overview, intended to explain and layout the overall topics of this paper.
Section 2: Applying Information Seeking to Electronic Environments
This section reviews the important models and studies in Information Seeking and Bibliometrics to understand and analyze Information Systems use.
Section 3: The Internet and the World Wide Web
This section introduces the Internet and World Wide Web, some of their basic standards and functionalities. Also included are descriptions and reviews of the data sources and measurement methods currently available to understand Web usage activity.
Section 4: Collaborative Filtering
This section provides a general introduction to Collaborative Filtering and presents recent significant studies and systems for Information Filtering using both the Internet and the World Wide Web. Also included are studies that illustrate general Collaborative Filtering techniques and a review of current Collaborative Filtering systems for both the Internet and the World Wide Web.
Section 5: Conclusion
This section concludes the research overview and summarizes the general ideas in the paper.
Section 6: Suggested Research Projects
This section proposes three research projects, each designed to answer questions about improving Information Seeking on the World Wide Web.
Section 7: Bibliography and Appendix A
A list of the works cited in this paper and explanatory information presented in the appendices.

2. Applying Information Seeking to Electronic Environments

This section reviews the important models and studies in Information Seeking and Bibliometrics (which can be seen as another way to model Information Seeking patterns) to understand and analyze Information Systems use.

2.1 Information Seeking Overview

This section focuses on Information Seeking in electronic environments, namely the World Wide Web. My goal is to explore an Information Seeking model that shows elements that can be augmented with Collaborative Filtering techniques developed through data collection and analysis. The Web environment, with its masses of unstructured and inconsistently coordinated information, is more suited to being interpreted by people than by machines. Collaborative Filtering is a quantitative way to develop qualitative data about information on the Web, thus maximizing both people and computer resources.

Due to the personal subjectivity and seemingly endless amount of Web information to examine, it is more useful to focus on perceptual and cognitive recognition via browsing the Web than determining precision of Web searches via Information Retrieval techniques. However, this is not simple, Information Seeking on the Web is difficult to measure because a user can never know he is finished. There is no definite ending point.

Information Seeking as a problem seems natural to augment with Information Retrieval ideas, but should be additionally leveraged with other users' Information Seeking behavior. At worst, Information Seeking and Information Retrieval can be scaffolded over each other to gradually build to a refinement of a user's information need.

Marchionini gives us an appropriate definition of Information Seeking: "a process in which humans purposefully engage in order to change their state of knowledge".(Marchionini 1995)

2.1.1 Information Seeking and Information Retrieval

Many studies point out the close relation between Information Seeking and Information Retrieval. Most notably, Saracevic, et. al's comprehensive analysis of Information Seeking and Retrieval provides excellent starting points for ideas about observation and collection of data that help establish a sense for context and classification of user questions; cognitive characteristics and decision making of users; and comparisons of different searches for the same question. The measures and methods of user effectiveness and searching provide a rich framework for further studies. (Saracevic and Kantor 1988a; Saracevic and Kantor 1988b; Saracevic et al. 1988)

These general differences contrast Information Retrieval research from Information Seeking research:

Information Retrieval:

  • historically, concentrated on the system
  • focuses on planning the use of information sources and systems
  • implies that the information must have been already known
  • relies on the concrete definition of query terms
  • involves subsequent query reformulations
  • centers on the examination of results and their accuracy.

Information Seeking:

  • historically, concentrated on the user
  • focuses on understanding the heuristic and dynamic nature of browsing through information resources
  • implies that the information is sought to increase knowledge
  • follows a more opportunistic, unplanned search strategy
  • involves recognizing relevant information
  • centers on an interactive approach to make browsing easy.

From a behavioral perspective, the primary difference between Information Retrieval and Information Seeking is searching vs. browsing. The focus of each domain is in the actions studied. As computer technology matures, Information Retrieval and Information Seeking studies are moving closer. In 1996 Saracevic states that "interaction became THE most important feature of information retrieval" as the access to Information Retrieval systems has become more dynamic.(Saracevic 1996) Essentially, the interactivity provides the ability to support more browsing-like approaches for finding information.

Therefore, to design a system for augmenting Information Seeking, a more robust understanding of the user and his interactions are in order. The measurement of successful Information Seeking requires more analyzing these subtler measures to gauge success. Again, this makes augmenting Information Seeking via collaboration more probable for success. Instead of relying wholly on Information Retrieval metrics, recording and comparing a user's interactions with a system can be used to enhance the information seeker's success.

New technologies, such as the easy-to-use World Wide Web browser, will promote more Information Seeking use (and attract new users). However, new interfaces alone will not help us find everything we seek, but we might believe so as we often think electronic information is more accurate or complete (Liebscher and Marchionini 1988). In a way, utilizing more collaboration between users can make up for some of the shortcomings of technical systems. Blending the different perspectives and experience levels of a pool of users can result in a larger body of resources discovered. Fidel points out two styles of expert searchers, the operationalists who understand the system and use high-precision searches and the conceptualists who focus on concepts and terminology to then combine results to form more complete searches (Fidel 1984). This combination of users cooperating can form a powerful team to enhance each other's Information Seeking.

2.1.1 Information Seeking Models

The influence of new technology on Information Seeking is also providing a new set of alternative models that more accurately describe the Information Seeking process as a dynamic activity. Models of Information Seeking attempt to describe the process a user follows to satisfy an information need. The Information Seeking models in this section focus on the behavior of Information Seeking activities. Ellis' Model of Information Seeking

The primary model used in this research will be based on Ellis' work - initially, his model with six categories (Ellis 1989). Since Ellis has stated that these activities are applicable to hypertext environments (of which the World Wide Web is one), I will use examples from Web browsing to illustrate each category:

Starting is identifying the initial materials to search through and selecting starting points for the search. Starting, as its name implies, is usually undertaken at the beginning of the Information Seeking process to learn about a new field. Starting could also include locating key people in the field or obtaining a literature review of the field. It is also common to rely on personal contacts for informal starting information. For example, in the Web environment, the activity of starting could involve going to the Yahoo! site to find the general category listing of links related to the field of inquiry and looking for overviews, FAQs (Frequently Asked Question files - a commonly-used informal document describing a particular subject), or reputable reference sites. Another possibility is going to a bookmarked page that has proved to be useful in previously looking for similar information or consulting a colleague's own Web page or one he might have recommended.

Chaining is following leads from the starting source to referential connections to other sources that contribute new sources of information. Common chaining techniques are following references from a particular article obtained by recommendation or a literature search to references in other articles referred to in the first article. It's also quite natural to pursue the works of a particular author when following these chains. There are two kinds of chaining:

  1. backward chaining is following a pointer or reference from the initial source. For example, going to an article mentioned in the initial source's bibliography.
  2. forward chaining is looking for new sources that refer to the initial source. For example, using a citation index to find other sources that reference the initial source.

The only real constraints to chaining are time available and confidence in pursuing a line of research further. For example, using a Web browser, backward chaining would be following links on the starting page (be it a online document or collection of links which we can assume are related in some way) to other sites. Forward chaining could involve using a search engine to look for other Web pages that link to the initial Web page.[1]

Browsing is casually looking for information in areas of interest. This activity is made easy by the nature of documents to have tables of contents, lists of titles, topic headings, and names of persons or organizations. Browsing is being open to serendipitous findings; finding new connections or paths to information; and learning, which can cause information needs to change. While on the Web, browsing is particularly unconstrained as the most-common way to follow a link is simply clicking the mouse. With link availability and adequate access speed, pursuing a new connection is quite simple. Only the worry of getting lost in an ocean of links might constrain browsing through the Web. A common example of browsing on the Web would be finding an online journal article and following its link back to the overall journal table of contents to an entire other article. This might in turn lead to a page linking to all of the journal's various contributing authors, its editorial board, or supporting organization`s home pages.

Differentiating is selecting among the known sources by noting the distinctions of characteristics and value of the information. This activity could be ranking and organizing sources by topic, perspective, or level of detail. Differentiating is heavily dependent on the individual's previous or initial experiences with the source or by recommendations from colleagues or reviews. A Web-oriented example would be organizing bookmarks into topic categories and then prioritizing them by the depth of information they present.

Monitoring is keeping up-to-date on a topic by regularly following specific sources. Using a small set of core sources including key personal contacts and publications, developments can be tracked for a particular topic. A Web browser monitoring activity could be returning to a bookmarked source to see if the page has been updated or regularly visiting a journal's Web site when it is scheduled to publish its new Web edition.

Extracting is methodically analyzing sources to identify materials of interest. This systematic re-evaluation of sources is used to build a historical survey or comprehensive reference on a topic. With a Web browser, extracting might be saving the Web page as a file or printing the Web page for use in an archive or for a segment of an overview document.

In follow-up studies, Ellis adds two more features to his model: verifying, where the accuracy of the information is checked and ending, which typifies the conclusion of the Information Seeking process such as building final summaries and organizing notes.(Ellis 1991) These changes not only reflect further studies, but I believe that as Information Seeking has become more mechanical, its processes are easier to note. However, despite refining the processes and the relationships between features of his model, Ellis also agrees that the boundaries between the features are very soft.(Ellis 1996) In using the Web, verifying might involve extracting keywords from a source and searching for corroborating information on another Web page. Admittedly, the Web's newness and large percentage of un-branded information make verification of information difficult. I suspect that often information is verified by checking traditional sources, not other Web pages.

Currently (Ellis 1997), Ellis has modified his model's features somewhat, improving starting to surveying. Surveying further stresses the activity of obtaining an overview of the research terrain or locating key people operating in the field. Differentiating has been refined to distinguishing, where information sources are ranked. Distinguishing also includes noting the channel where information comes from. Ellis points out that informal channels, such as discussions or conversations, are normally ranked higher as well as secondary sources, such as tables of contents or abstracts, than full text articles. This is most likely due to the increased use of electronic resources and their capacity to overload a user. For the user some kind of hierarchy of results must be formed to place order on the Information Seeking process. With the Web, it is either easy to discover the channel of information (a Web site owned by an organization) or quite difficult to confirm (a resource included on a personal Web page) due to the ease of moving and presenting information on the World Wide Web.

Another new feature of the model has also been added-- filtering, which capitalizes on personal criteria or mechanisms to increase information precision and relevancy. Typical examples of filtering are restricting a search by time or keyword. This idea of filtering, in more than name, points out that Information Filtering is a crucial element of study in Information Seeking. In a Web browser session, filtering would likely involve restricting a search for information (using a Web search engine or even on a particular Web site) by the date published or carefully noting the URL[2] of the Web page. When combined with distinguishing, where resources are actually ranked and sorted, we also begin to see how Information Retrieval is alluded to in Ellis' model of Information Seeking. This figure illustrates Ellis' current Information Seeking model. Note that the overall structure of the process could be contained inside each activity, implying the fractal-like nature of the processes.

Figure 1. Ellis` Information Seeking Model Applying Ellis' Model

I propose that the Information Seeking process is fractal-like in nature. Each feature follows the overall feature set within itself. For example, within surveying, there surely must be chaining, browsing, differentiating, not to mention ending that formalizes the completion of the step. Like a fractal, even the smallest change in a sub-feature (as I shall now call them) can impact not only its parent feature, but have substantial impact on the entire Information Seeking process. This is more than just refinement of a search, the very features of Information Seeking can take on a different mapping as the seeker, the sources, and technology change. It is these variations that make collaboration in all three of these domains where Information Seeking can be substantially improved.

For example, collaboration among seekers is the most obvious area of improvement of the process and the focus of this paper. Different users can share previous findings or cooperate to minimize future work. Sources can be more easily linked and shared as more become available digitally. Improved technology can enable more automation of monitoring; combining and comparing results; and distribution of user profiles or programs that can provide starting points for Information Seeking.

Ironically, as resources become more plentiful due to technology, they are also being more loosely, if at all, classified. The resource demands of publishing information are far less than direct expert classification and often exclude indexing. Without common organization among electronic resources, more individual work will be needed to build maps of a research terrain. Again, Collaborative Filtering can help in an ad hoc way by at least establishing operational classifications of information by communities of users who pool their resources. Their resources can not help but become classified in some form: by user, by implicit or explicitly agreed-upon language, or by usage ranking as resources fall prey to limited attention.

2.1.3 Kuhlthau's Model of the Information Search Process

Kuhlthau provides an additional model which focuses on the information search process from the user's perspective. Her six stages in the Information Search Process (ISP) Model are:

  1. initiation - beginning the process, characterized by feelings of uncertainty and more general ideas with a need to recognize or connect new ideas to existing knowledge.
  2. selection�- choosing the initial general topic with general feelings of optimism by using selection to identify the most useful areas of inquiry.
  3. exploration - investigating to extend personal understanding and reduce the feelings of uncertainty and confusion about the topic and the process.
  4. formulation�- focusing the process with the information encountered accompanied by feelings of increased confidence.
  5. collection - interacting smoothly with the information system with feelings of confidence as the topic is defined and extended by selecting and reviewing information.[3]
  6. presentation - completing the process with a feeling of confidence or failure depending how useful the findings are.(Kuhlthau 1991)

2.1.4 Belkin's Information Seeking Process Model

Belkin provides another view of the Information Seeking process, described as Information Seeking Strategies (ISS). This view can be perceived of as a more task-oriented overlay of either Kuhlthau or Ellis' model. The set of tasks are:

  • browsing�- scanning or searching a resource
  • learning - expanding knowledge of the goal, problem, system or available resources through selection.
  • recognition - identifying relevant items (via system or cognitive association).
  • metainformation�- interacting with the items that map the boundaries of the task (Belkin, Marchetti, and Cool 1993).

Again, this model is not linear or like a typical waterfall flow of process. Belkin even stresses this non-linearity in that he suggests that the model should support "graceful movements" among the tasks.

2.1.5 Belkin's Anomalous States of Knowledge

Belkin also provides some useful perspectives with the Anomalous State of Knowledge (ASK) theory, "the cognitive and situational aspects that were the reason for seeking information and approaching an IR system" (Saracevic 1996). Belkin proposes that a search begins with a problem and a need to solve it - the gap between these is defined as the information need. The user gradually builds a bridge of levels of information, that may change the question or the desired solution as the process continues (Belkin, N.Oddy, and Brooks 1982).

In other words, this view of information seeking is as a dynamic process with varying levels of expertise growing in regard to knowledge about the solution and in using capabilities of the particular information system itself. Taking these ideas, Belkin advocates a systems design using a network of associations between items as a means of filling the knowledge gap. By establishing relationships between individual pieces of knowledge, a bridge of supporting information can be used to cross the knowledge gap. Using a collection of associations in this manner provides a framework that can be applied to designing Collaborative Filtering mechanisms, which work from building associations between users.

2.2.1 Expanding Information Seeking

There are a few other key ideas from Information Seeking that I will take advantage of in analyzing Collaborative Filtering systems. Both the study of gatekeepers and expanding Information Seeking to the domain of Environmental Scanning prove to be the most relevant. Gatekeepers

Gatekeepers (Allen 1977) are individuals who read more journals, have more external contacts, generate more ideas and engage in more problem solving than typical individuals. Examination of these types of users can help build models of information seeking as their activities are more intense than typical information seekers. In addition, gatekeepers are known to provide information to others and also give practical and political advice. Environmental Scanning

Environmental Scanning studies also expand the area of Information Seeking studies. Choo and Auster`s historical review of studies bring the user's entire environment into focus.(Choo and Auster 1993) Not only is the information system studied, but it is compared and contrasted to other information resources. Aguilar's (Aguilar 1967) four methods provide a broad map of the environment by defining four major modes of scanning:

  1. undirected viewing - "general exposure to information where the viewer has no specific purpose in mind with the possible exception of exploration" involving a wide range of information sources of varying types that might only have "vague and tentative" relevance. Most importantly, it involves selection of information based on previous experience and interest (echoing the building upon and establishing networks of knowledge in Belkin's ASK model).
  2. conditioned viewing - "directed exposure, not involving active search to a more or less clearly identified area or type of information", such as browsing information sources and noting their importance to certain topics of interest because of conditioning to do so.
  3. informal search - "relatively limited and unstructured effort to obtain specific information or information for a specific purpose", such as viewing the most-likely sources of relevant information and informing other's as to your interest. This mode leverages the strengths of the gatekeeper concept in that if the right gatekeepers know of the information need, they can begin to provide information as they discover it.
  4. formal search -"a deliberate effort--usually following a pre-established plan, procedure, or methodology--to secure specific information or information relating to a specific issue". This would include using an Information Retrieval system to perform a comprehensive search for a particular subject, previously the most studied of these four modes of scanning.

Aguilar has shifted the focus to the information sources and how they are used. His modes point out that in any real-world setting, resource scarcity inhibits pursuing all scanning with formal searches. A variety of approaches are most efficient, with continually shifting information needs as the underlying prioritizer. By utilizing all of these scanning techniques, a broad net is cast to stay informed. This point of view can help shape mature information systems to work in harmony with a user by considering the total view of information resources in the environment. By looking at not just the information found by the system, but also information found and shared by other users also aids in measuring how much collaboration is relied upon in Information Seeking.

This leads to views that consider the whole user experience with information. Kolodner (Kolodner 1984) approaches this from a cognitive level with her episodic view of memory. She claims that memory is organized around personal experiences or episodes, and not around abstract semantic categories. Repeated or similar memories are used to build up generalized knowledge about episodes. It is these types of broadly oriented episodes that Environmental Scanning seeks to capture. This generalized, episodic knowledge can later be used for understanding. Thus, every person's episodic memory will be different from every other person's. This view also lends itself to promoting collaborative Information Seeking in that it acknowledges that no group of users, no matter how similar in behavior, system use, or domain experience can not help but benefit by utilizing the collective resources of the group. This echoes the mantra of Collaborative Filtering that "no one is as smart as everyone". The application of Collaborative Filtering as an augmenting strategy for Information Seeking can be quite useful.

Future directions in Information Seeking are therefore amplification and augmentation; leveraging communities of users with different perspectives and interdisciplinary backgrounds; and the use of scenarios involving knowledge bases or agents. My goal is to improve Information Seeking using a combination of all three: augmenting scanning by pooling groups of users via a knowledge base-like set of resources.

2.3 Collaborative Information Seeking

Understanding groups of users working together in electronic environments is the focus of Computer Supported Cooperative Work (CSCW). CSCW is a relatively new area of study that focuses on ways to improve groupwork with information systems. Due to the nature of my research, it is important to note some relevant ideas and studies developed in the CSCW community. Information Seeking systems that utilize CSCW ideas can be more useful in a networked environment.

We know from environmental scanning studies that one of the most effective channels for the dissemination of information and expertise within an organization is its informal network of collaborators, colleagues, and friends. Social networks can be as important as official organizations. The advantage of this kind of network is a tradeoff: you can spend time scanning for information yourself or use that time asking others. CSCW studies try to find systems that help provide the right balance.

At the same time, CSCW applications help to uncover and take advantage of existing social networks much like the bibliometric technique of co-citation analysis where following a reference in a document can lead to another work, leading to another reference, etc. establishing a network of related information on a topic. CSCW studies try to augment the strengths of social networks to build richer relationships between communities of people easier to establish and maintain.

Even though computer networks are the primary means of collaboration today, as long as some kind of communication channel is provided, a group can access its collected intelligence. An example of this is the Livewire experiment that actually underscores the collaborative process by not using convenient computer networking. A single diskette of information was passed around a community of scholars with the encouragement to add to the data on the disk. Each participant could make her contributions and save a personal copy of all of the information for herself. As the diskette made the rounds of participants, the information on the disk increased, as did the progress of the group toward building a body of useful work. They found that "a large part of a group process is information sharing. While networked computers usually allow people within an institution to share information across common data files, networks are rarely available for loosely-coupled social groups" (Witten et al. 1991). The Livewire project illustrates the power of computer-supported information sharing. When even a simple information sharing system like this succeeds in both augmenting each user's information base and helping to establish a workgroup community, it is apparent that the added efficiency of computer networking can improve both.

A more general view of this phenomenon is offered in Derrick de Kerckhove`s 1997 book Connected Intelligence: The Arrival of the Web Society. He argues that the main advantage we have gained from computers in working together that has made all the difference is asynchronicity -- the ability to work in multiple, virtual, and mobile groups. de Kerckhove shows a number of implications possible with collaborative Information Seeking:

  • Working with others to meet information needs will decrease the need for specialization as it will not be as important to know everything about a subject, just who to ask.
  • The obvious benefits of electronic collaboration may eventually actually increase human interactions, both personal and commercial.
  • The information processing advantages of information systems can speed the gradual self-organization of information sources, increasing the efficiency of tools for Information Seeking.
  • As these benefits multiply, they will gradually change norms about relationships and technology.

While some of these trends may actually not be totally beneficial, promoting (under some circumstances) such problems as groupthink, over simplification, or inter-dependence on others, these trends are still the reasons why collaboration is powerful. Groupthink-like problems actually increase participation and an individual's satisfaction with a decision; over simplification can at least provide more understandable starting points for exploration; and inter-dependence on others for pooling resources and knowledge is acceptable provided that technology is there for advanced support.

de Kerckhove points out that the law of network connectedness is that a simple linear curve of connections leads to an exponential multiplication of interconnections. In turn, this multiplication of interconnections gives rise to emergent markets made of new configurations within a network. One example is fax machines, as there are more machines used, the overall value of the each machine increases. Economist Brian Arthur calls this the law of increasing returns (Arthur 1988) (just one of the new principles of a networked economy).

Nardi and Millers 1991 ethnographic study of users of spreadsheet applications, long considered a good example of a well designed "single-user" application, shows how inevitable computer technology is making collaboration in meeting information needs. Even though spreadsheets were designed to be stand-alone applications, they found that people used each others help in preparing spreadsheets with informal workarounds to make the seemingly single-user spreadsheet application a groupwork tool. Spreadsheet co-development was the rule, not the exception in the environments they studied. There was cooperation in sharing program expertise; transferring domain knowledge; debugging spreadsheet formulas; training; and face-to-face work in meetings.(Nardi and Miller 1991) The point is that people can not help but organize themselves and their work so that complex problems can be solved collectively.

2.3.1 Capitalizing on Information Seeking Research

Software designers are taking all of these findings more readily into their design decisions. Volker Wulf contends that these findings have changed our models of an organization. This kind of self-organization "marks a fundamental turn from a mechanistic perception of an organization towards an organic understanding." He goes on to state that "today, more attention is paid to the social processes of group and norm formation and to the development of social networks"(Wulf 1996). This turn towards people is primarily because the software design community's previous attempts, such as Artificial Intelligence and Knowledge Engineering did not prove good enough and now we are designing systems to leverage the preexisting intelligence of individuals and using software to coordinate the process.

In terms of Information Seeking and Information Retrieval, Ehrlich and Cash point out that in information-intensive jobs like customer support work, system users rely heavily on each other's recorded knowledge to diagnose and solve customer problems (Ehrlich and Cash 1994). Studies like this helped to shape the development of Lotus Notes where Information Filtering and Retrieval are highly collaborative. Notes workgroups routinely include at least one person (what we might now call a gatekeeper) who is especially skilled at browsing through large amounts of often diverse information and then making his or her findings available to others generally with commentary on its relevance and/or intrinsic value (Ehrlich 1995). Lotus Notes users can annotate their own and other's documents and send them to others. Document summaries can be made that work as a custom table of contents. These lists can be used by others to build their own tables of contents or as starting points for Information Seeking.

Building on systems like Lotus Notes, in 1991, Cook and Birch (Cook and Birch 1991) developed guidelines for a distributed system that would enhance convenient and effective communication; give information about each other's status and whereabouts; and support people in planning and execution of various kinds of office tasks. Their list included:

  • shareable objects in a distributed address space
  • views, icons, and direct manipulation
  • WYSIWIS (What You See is What I See) via common views
  • private and shareable spaces
  • visual editing views
  • activity models
  • object naming and behavior
  • integration of multiple media
  • ownership and authentication

These guidelines describe almost exactly the browser interface developed to find and browse information on the internet -- the World Wide Web browser.

2.4 Bibliometrics

As information on the Web increases toward entropy, we need to apply theory from other disciplines to develop new methods, modeling techniques and metaphors to examine this emerging complex network. One such set of techniques from Information Studies is Bibliometrics. Diodato defines Bibliometrics:

"a field that uses mathematical and statistical techniques, from counting to calculus, to study publishing and communication patterns in the distribution of information" (Diodato 1994).

The operational methods of Bibliometrics can be useful to produce quantitative data and subsequent analysis to understand and improve information systems use.

2.4.1 Bibliometrics Overview

Bibliometrics stems from the idea that distribution and use of information has patterns that can be analyzed by counting and analyzing citations, finding relationships between these references based on frequency, and using other statistical formulas. Common sources of bibliometric data are the Science Citation Index or Social Science Citation Index, where an author's publication influence can be seen by how often she is referenced.

Classic Bibliometric analysis begins by counting all types of citations with little (or any) weight to reference types. A refinement of mass citation counting is direct citation counting where the quantity of citations are tracked over a given period of time to test for aspects of an author's or article's impact. The standard formula for impact is:

(number of journal citations)/(number of citable articles published)

While somewhat blunt, applying and averaging citations along certain scales begins to become even more useful. Another basic bibliometric technique is calculating an immediacy index of influence using this formula:

(number of citations received by article during the year)/(total number of citable articles published)

This is also a useful metric to see a broader view of an article's impact, but it is not wholly objective. Some journals have longer publication histories, are considered more prestigious (making publication competition more intense), or may require many references (historical listings of past articles, for example) -- all of which can skew immediacy data. Bibliometric Coupling

Kessler suggests a technique known as bibliographic coupling: measuring the number of references two papers have in common to test for similarity. He then showed that a clustering based on this measure yields meaningful groupings of papers for research (and information retrieval) by finding "a number of papers bear a meaningful relation to each other when they have one or more references in common"(Kessler 1963).

Kessler also found a high correlation between groups formed by coupling and groups formed by subject indexing. As Bibliometrics has become more automated, many have tried to take these techniques and engineer software to detect patterns and establish relationships between articles (Price 1968), .

The first of these studies by Schiminovich, and others (Schiminovich 1971) made the first real steps towards applying Bibliometrics to electronic publications (notably, physics literature stored on a time-sharing system at MIT) to begin automatically unearthing these types of patters. Moreover, once data collection and input issues are overcome, information can be measured in its raw form on a computer far quicker than by hand. Cocitation Analysis

Marshakova (Marshakova 1973) and Small (Small 1973) (independently) developed coupling further by noting that if two references are cited together in a publication the two references are themselves related. The greater the number of times they are cited together, the greater their cocitation strength. The major refinement between bibliometric coupling and cocitation is that while coupling measures the relationship between source documents, cocitation measures the relations between cited documents. This implies that an author purposefully chose to relate two articles together, not merely an association between two articles as coupling reveals.

To tune these techniques to finer levels involves asking a series of questions about the reference:

  1. is the reference conceptual or operational?
  2. is the reference organic or perfunctory?
  3. is the reference evolutionary or juxtapositional?
  4. is the reference confirmative or negational? (Moravcsik 1975)

These questions reveal granularity and trends in measurement such as trends to refute (or present an alternative to) an article or the growing importance of a mode of thought which shows the maturity level of an idea as it evolves through the literature.

2.5 Bibliometric Laws

These bibliometric laws are the most widely-know and used. They are all related, and even derived from each other. Like physical laws, they seek to describe the working of a system by mathematical means. They are incredibly useful in developing general theories about information and provide data to study further. They all have potential for use in Web measurement and Collaborative Filtering applications.

The three primary laws of Bibliometrics are Zipf's Law and its derivatives- Lotka's Law and Bradford's Law of Scattering. Each of these are examined in this section as well as some corollaries and supporting ideas.

2.5.1 Zipf's Law

The most powerful, wide-ranging law of Bibliometrics is Zipf's Law[4]

Zip's Law essentially predicts the phenomenon that as we write, we use familiar words with high frequency. Specifically, a distribution applied to word frequency in a text states that the nth ranking word will appear k/n times, where k is a constant for that text.

For analysis, this can be applied by counting all of the words in a document (minus some words in a stop list which are most-likely in the stop list due to Zipf's law) with the most frequent occurrences representing the subject matter of the document.[5]

We could also use relative frequency (more often than expected) instead of absolute frequency to determine when a new word is entering a vocabulary.

Zipf said his law (called The Principle of Least Effort) is "the primary principle that governs our entire individual and collective behavior or all sorts, including the behavior of our language and preconceptions"(Zipf 1949). Zipf is saying that the main predictor of human behavior is striving to minimize effort. Therefore, Zipf's work applies to almost any field where human production is involved. Zipf and the Web

Huberman, (Huberman et al. 1997) points out in an empirical study that there are Zipf-like distributions in path lengths and page visits to sites on the World Wide Web. In other words, the farther down the number of links a user must travel to view a Web page, the smaller the number of visits a Web page received.

Regarding publications (and therefore the Web) Zipf points out "that individuals will at all times try to minimize effort, then it follows that the reason for their buying and reading newspapers is that such conduct is an economical method of learning of those events in their environment that may be of positive or negative value for their particular economies... in order to lure these potential buyers into the paper's reader population for the sake of increasing the circulation, the editor must increase the diversity of his news items" (Zipf 1949). This can apply to more than just newspapers, it could be any information source. The World Wide Web is truly diversified and not limited by page size (as Zipf was concerned with for print), but financial resources limit the amount any one organization can publish. However, as World Wide Web publishing gets easier and more people have access to the Web, will this formula still hold? As mentioned earlier, many different models have proven that they can scale up to the Web's size. In this case, the economics of learning would not be primarily focused on money, but time spent reading or finding useful information on the Web. Essentially, I argue that Zipf's Law then has an even more subtle credibility as we begin to focus on a user's cognitive effort, not just economic.

2.5.2 Lotka's Law

Lotka's law states that in "a well defined subject field over a given period of time ... a few authors are prolific and account for a relatively large percent of the publications in the field"(Diodato 1994). Generally, this means Lotka's Law is an inverse square law that for every 100 authors contributing one article, 25 will contribute 2, 11 will contribute 3, and 6 will contribute 4 each. We see a general decrease in performance among a body of authors following 1:n2. This ratio shows that some produce much more than the average which seems agreeably true for all kinds of content creation. However, Lotka does not take impact into account, only production numbers. Furthermore, in 1974, Voos found that in Information Science, the ratio was currently 1:n3.5 (Voos 1974). Thus, we can say that Lotka's Law may not be constant in value, but in following inverse square. Our challenge will then be to find the correct ratio in different environments and fields.

2.5.3 Bradford's Law

Bradford revealed a pattern of how literature in a subject field is distributed in journals. "If scientific journals are arranged in order of decreasing productivity of articles on a given subject, they may be divided into a nucleus of periodicals more particularly devoted to the subject and several other groups of zones containing the same number of articles as the nucleus" (Drott 1981). This essentially indicates that publications have a "core and scatter" phenomenon where a few core journals are prolific in publishing articles while there other journals publish progressively fewer articles.

These core journals occupy the "core Bradford zone" with two following zones (the "scatter zones") of much less influence as shown by the number of journals required to equate with the core total. For example, Bradford in (Diodato 1994):

  1. the top 8 journals produce 110 articles;
  2. the next 29 journals produce 133 articles;

  3. the next 127 journals produce 152 articles.

This shows that the numbers of the three groups of journals to produce nearly equal number of articles is roughly in proportion to 1:4:16 or 40:41:42. This makes n=4 in the general proportion 1:n:n2 where n is called the Bradford multiplier.

Bradford's Law therefore makes it possible to estimate how many of the most productive sources would yield any specified distribution of the total number of items.

Bradford's Law can also be used to measure of the rate of obsolescence, over time, by distinguishing between the usage of the levels of items. Essentially, this is a method of clustering. For example, a collection of journal articles in 9 journals may account for 429 articles, the next 59 journals may account for 499 articles, while the last 258 journals can account for 404. We roughly get three groupings (ranging from 404 to 499) of articles. Bradford noticed this consistent number of titles it takes to contribute to each third of the total population of articles.

Bradford discovered this regularity by calculating the number of titles in each of the three groups: 9 titles, 9x5 titles, 9x5x5 titles. Drott suggests that we can apply this widely, as long as we account for sample sizes, area of (journal) specialization and journal policies (Drott 1981). Bradford and the Web

B.C. Brookes takes Bradford one step further pointing out that he is correct if we have a finite (manageable, relevant) number of journals. Editorial selection and publishing costs currently determine much of the structure and content amount of most publications. Therefore, Brookes' point may be more relevant when analyzing the expanding, multi-relational World Wide Web. Will these ratios hold, change or not apply at all? With a deluge of information, we may find the limits to this law. Again, Brookes implies this by stating that "index terms assigned to documents also follow a Bradford distribution because those terms most frequently assigned become less and less specific... (but)... increasingly ineffective in retrieval"(Brookes 1973). Essentially, volume and homogeneity can work against us--the more a term is used, the more it becomes generic and less useful in locating specific information.

More positively, we may discover that there is a "half-life" during which Bradford's Law applies. As time passes, we can observe a fairly regular increased use of a term or citation, followed by a relatively steep and permanent reduction later. We can examine this phenomenon over time, to see that citations originally counted year by year can be expressed as the geometric sequence:

R, Ra, Ra2, Ra3, Ra4, ..., Rat-1

where R is the presumed number of citations during the first year (some which could not immediately be referenced in publication), but as a<1, the sum of the sequence converges to the finite limit R/(1-a). After a fairly predictable time (for each area of study) usage drops off. This type of measurement has utility when examining information on the World Wide Web. As the Web by nature publishes information prolifically and often about technical information or general news (which again, changes rapidly) we see how thinking about what constitutes a "Web year" of time might impact Web documents. Most Web users agree that information use evolves much faster on the Web, if a Web site or page is useful, its use will be further amplified by links to it or other sources commenting on it or other Web sites similar to it being created. Sites that are not read much (if the author is aware of the site`s readership) seem to naturally never get updated or eventually lose their place in the attention of the Web. Indexes no longer point to them, links to their pages are not updated or eventually atrophy off the list, and interest shifts to a more novel coverage of the topic or another topic altogether.

2.6 Bibliometrics and the World Wide Web

Bibliometrics are directly applicable to analyze the World Wide Web, the largest network of documents in history. We can use Bibliometrics techniques, rules and formulas in a "pure" environment - information about information organized on information systems.

Many applications of Bibliometric concepts can be used to deduce how to manage Web resources by measuring resource access and to test the utility of a document (assuming that a visit or return visit implies the document was useful). The most-used documents can be enhanced or used as a starting point for other topic selections. Even caching can be (and is) helped with bibliometric-oriented concepts: keeping the most frequently accessed Web pages available in the cache for the quickest access is good optimization of Web server resources. However, this in turn might cause cached files to be requested more often simply because the reliability of being received by a Web browser is higher. Essentially, this is Price's "cumulative advantage model" where simply put: success breeds success. However, it also implies that an obsolescence factor is at work, in that information that is not used will atrophy on the Web (Price 1976).

Now that Bibliometrics have been overviewed, we can move into how they would be specifically applied to the World Wide Web.

2.6.1World Wide Web Surveys

Web surveys are another way to gather information usage data. Since the Web has existed, there have been numerous studies to determine both usage and user demographics. Most of these efforts are similar to bibliometric techniques and reveal similar information. These surveys are looking for most-referenced Web pages, the most prolific or important Web sites, and what new words can be used to classify Web sites and documents. By examining these surveys, we can devise methods to refine or modify general Bibliometrics to find out who users are and what they are doing. There are essentially two kinds of surveys possible: questioning Web users and sampling Web documents.

The Georgia Tech Graphics, Visualization, and Usability Web Surveys are the cornerstone of Web user demographics. Through clever use of programming, data gathering, and extensive statistical analysis we are finally developing an outline of web users and their preferences. By using HTML forms, qualitative information can be collected in addition to the large amounts of quantitative data. Techniques such as oversampling or random sampling (of users who completed surveys) are also applied to remove as much ambiguity as possible (Kehoe 1996). Bibliometrically, these surveys are looking for clustering of various kinds to determine trends in Web usage. Surveys like these could be extended to actually gather user-revealed Web metrics by observing how users actually react and use specific Web documents, not by asking them. As these techniques are refined, they can be automated in an effort towards adaptable Web interfaces.

Web search engines essentially are continually polling Web documents for content. However, they can also be used to understand the characteristics of Web documents. Just as different publishing guidelines, norms, and characteristics of their formats influence print Bibliometric data, Web documents must have influences too. In a comprehensive analysis of over 2.6 million Web documents, the Inktomi development team (which is now the Hotbot search engine) examined:

  • document size
  • HTML tag usage
  • amount of tags to overall document size ratio
  • tag attribute usage
  • tag syntax errors
  • browser-specific extension usage
  • protocols use in child URLs
  • file types used in child URLs
  • number of in-links (links that point to other locations on the page itself)
  • readability
  • Web server port usage (Woodruff 1996).

Primarily, they aggregated the data to create a number of ranked lists such as the ten most-used tags and the ten most common HTML errors. However, what this data begins to reveal is that with this many available characteristics to measure in a document, a "document fingerprint" might be possible to identify. This type of identification could then be used to compare and identify Web browser user's behavior regarding a particular type of Web document. Trigg defines a comprehensive set of link types that could be used in online environments, but as of yet, they are not in use (Trigg 1983). Gathering Bibliometric Data from Web servers

Many of these ideas are used in Downie's attempt to apply Bibliometrics to the Web (Downie 1996). Mainly, attempts were made to analyze the following categories:

  1. User-based analyses to discover more about user demographics and unveil preferences based on:
    • who (organization)
    • where (location)
    • what (client browser).
  2. Request analyses for content.
  3. Byte-based analyses to measure raw throughput by certain timeframes (bytes served per day, etc.).

The power of these techniques is that they can be merged to develop a detailed scenario of a user's visit(s) to the Web site and their preferences, problems and actions. When using bibliometric analysis techniques, Downie discovered via a rank-frequency table of Web pages accessed on his server that requests conformed to a Zipfian distribution.

Other results confirm that poor Web server configuration and lack of access or use to full log files can hinder further results. It is also worth noting that Downie attends to ethical observation issues that many Webmasters and information system professionals do not normally consider. Gathering Bibliometric Data from External Web Sources

It is also possible to gather information about a Web site by finding information external to the site from various sources on the Internet. If a site is developed using the above methods, more site-external data can be collected. An obvious example is to use a search engine to find references to the Web site in general or to specific URLs (Uniform Resource Locators). This process is made simpler since page and directory names are somewhat unique. However, for a true crosscut, use different search engines for a variety of data along time and geographic regions. Each search engine has its own update times as well as techniques to filter redundant and similar information - use that as an advantage in gathering data.

Using programs much like search engines to follow all of the links in any referenced sites could also reveal less obvious links or references. A reverse parsing of URLs (focusing on the "end" of the URL, its organization type: "com" for business, "org" for organization, or "edu" for educational institution) could reveal bibliometric data about what kind of Web sites focus on certain types of content or changes Web pages in a timely fashion.

For fine-grained analysis, services such as DejaNews can show postings and access statistics on certain Web usage data. Finally, utilizing a customized robot or spider to search internal and external Web sites for commonalities and references can also yield new data.

3. The Internet and World Wide Web

This section introduces the Internet and World Wide Web, some of their basic standards and functionalities. Also included are descriptions and reviews of the data sources and measurement methods currently available to understand Web usage activity.

The internet is the most well-known networked community, with millions of users. The most-used interface to access information on this global network is the World Wide Web browser. Note that browsers, like Nardi's comment about spreadsheets, were also designed to be single-user applications, albeit to mainly move through other's documents (Berners-Lee 1992).

Tim Berners-Lee, creator of the World Wide Web (the Web), used the element of hypertext to provide the architectural metaphor. Via a simple set of protocols, an easy-to-use document description language and a graphical browser, he managed to build a foundation for information access that has revolutionized computing (Berners-Lee 1994).

Note again that Ellis points out that Agosti considered that hypertext and Information Retrieval share a common underlying structure, which the Web embodies (Ellis 1996). This helps to strengthen my conclusion that studying Information Seeking provides a framework for understanding World Wide Web use.

3.1 World Wide Web Foundations

The World--Wide Web is essentially a set of protocols and standards built on top of the internet. These standards are maintained and improved by the World Wide Web Consortium and include these features:

Hypertext Transfer Protocol (HTTP) the set of actions that transfer information between Web servers (that both serve information and track usage) and Web clients (most notably graphical browsers). The primary weakness of HTTP is that it is stateless. That is, when a user interacts with a Web site, each interaction (whether downloading one of dozens of graphics or text files) is separate -- to the Web server each request could come from any combination of different clients. A typical HTTP transaction does not utilize any information about its requesting client that would preserve information about what that client has previously transacted.

Hypertext Markup Language (HTML) the current standard tagging language for information elements in a file that is sent by Web servers and received by Web clients. Mostly for on-screen formatting of documents, newer HTML versions are also beginning to describe information characteristics (known as meta-information) such as author, content-type, or index keywords.

Common Gateway Interface (CGI ) a set of functions and statements that enable logical operations on a Web server. CGI scripts, in any number of programming languages, coordinate the advanced features of Web servers including authentication, customization, and working with other programs. Many Web servers also support APIs (Application Programming Interfaces) that provide a framework for integrating external code with their servers.

Java is an object-oriented language that is mostly operating system-independent enabling program portability between different client machines. Recently, server-side Java is also being used to enhance Web server features beyond simple CGI applications.

JavaScript, related to Java in name only, it is an easy-to-use scripting language to control basic client or server behavior.

Platform for Internet Content Selection (PICS) is an independent rating system that includes labels in internet documents that authors or labeling services to rate a Web page or site depending on its content. PICS can be used by Web browsers to act on these ratings to restrict or recommend a Web page or site (Miller, Resnick, and Singer 1996). Related to our concerns, these ratings can be used a data for Information Seeking, Information Retrieval, or Collaborative Filtering applications.

Resource Description Framework (RDF) is a proposal by the World Wide Web Organization as a technical foundation for providing interoperability between applications that exchange machine-understandable information on the Web. RDF can enhance search engine resources discovery; enable software agents to both exchange and share information (like PICS) in content rating; and describe collections of Web pages as a single "document" for copyright and description rights.

Platform for Privacy Preferences (P3P) is a draft set of technical aspects of Web privacy concerns that lets Web sites set preferences about their privacy practices and allow Web users to control the release and use of their data. For example, a set of default preferences enables the Web user to allow her user information to be used only for enhancing her interaction with the Web site. P3P coordinates this agreement between the user and the Web site.

Open Profiling Standard (OPS)�- is one of the areas within the P3P preferences that describe the standards for Web user profile information for transfer between Web users, Web sites, and Collaborative Filtering applications. Individual privacy is stressed allowing users control over their profile. For example, a user's profile may tell a Web site that he does not like frames or want her information presented in Dutch, if possible. A Web site providing collaborative information would automatically know (if allowed to) what types of products or services to present to a Web user.

The combination of these openly published standards enable basic individual and group interaction on the World Wide Web. With the growing universality of the Internet, the World Wide Web is proving to be the primary platform for information seeking and groupwork. The simple functionality of these standards makes developing applications within their limitations both rapid and resource efficient. With the number of developers working on internet applications, the pool of new tools, code libraries and interfaces is growing more rapidly than any past computing platform.

Indeed, it is as if the Internet has a manifest destiny to be the communication medium of computers in the future. Rice echoes this by proposing using the open protocols of the World Wide Web as the entire interface for the user. In their paper, they describe how to:

"deliver a sophisticated, yet intuitive, interactive application over the web using off-the-shelf web browsers as the interaction medium. This attracts a large user community, improves the rate of user acceptance, and avoids many of the pitfalls of software distribution."(Rice et al. 1996)

While they admit that Web systems place a new set of constraints on user interface design, they provide work-arounds to address many common difficulties. As these difficulties are overcome, they believe that the expanding use of the Web will guarantee that software delivery over the Web will become more common.

My research will utilize these standards as much as possible to develop suites of applications to augment information seeking via improved collection and use of Web browser data.

3.2 Information Seeking on the World Wide Web

The ubiquity of the World Wide Web lets us concentrate the more user-centric issues of computing. Gaines points out that the

"society of distributed intelligent agents that is the Internet community at large provides an 'expert system' with a scope and scale well beyond that yet conceivable with computer-based systems alone. ... In developing new support tools one asks "what is the starting point for the person seeking information, the existing information that is the basis for their search?" A support tool is then one that takes that existing information and uses it to present further information that people, sites, list servers, news groups, and so on. The support system may provide links to further examples of all of these based on content, categorizations or linguistic or logical inference. The outcome of the search may be access to a document but it may also be email to a person, a list or a news group." (Gaines 1997)

His point is that with the mass of users on the Web, news tools and ideas are quickly adopted or discarded through evolutionary selection. This provides an arena unlike any other in computing that forces designers to not only consider, but focus on the information needs of the user. The tools available to find information on the Internet are surprisingly sophisticated, but the challenges of augmenting Information Seeking properly are so great that using the Internet as the primary information source is still difficult.

3.2.1 Organizing Information on the Web

One approach to making Information Seeking more accurate on the Web consists of the various attempts to further organize and type information. For example, "hyper" information is being pursued by Marchiori (Marchiori 1997) that takes into account the web structure the (information) object is part of. Criteria like Web site visibility (ease of viewing the information), accessibility (who hosts the site and how much traffic they can handle), identifiability (how easy it is to discover what the information is about), and its referencability (what is linked from and to the site) tells us a lot about a site's content. These subtler measures are good criteria to begin considering when gathering information about a Web site's content.

Ellen Spertus, is working at this problem by taking advantage of the Web's naming conventions. She shows that the varieties of Web architectures, links, and categorization can be used to find an individual or specific homepage, new locations of moved pages, and un-indexed pages. She bases these ideas on a set of heuristics about Web pages including:

Heuristic 1 (Taxonomy Closure): Starting at a page within a hierarchical index (table of contents or Web site map), following downward (down another subdirectory on the Web server) or crosswise (to another directory at the same level in the Web server's directory structure) links leads to another page in the hierarchical index whose topic is a specialization of the original page's topic.

Heuristic 2 (Index Link): Starting at a page of links, any page reached by following a single outward link is likely to be on the same topic.

Heuristic 3 (Repeated Index Link): Starting at an index P and following an outward link to index P', a page reached through an outward link is likely to be on the same topic (or a specialization of it) as the original page P.

Heuristic 7 (Spatial Locality of References): If URLs U1 and U2 appear "near" each other on a page, they are likely to have similar topics or some other shared characteristic. Nearness can be defined as proximity in the structural hierarchy of the page.(Spertus 1997)

The various Web search engines and an individual's personal pages of bookmarks also contribute to this kind of recommended reference information, but they are passive, reactive descriptions of Web content. We need more proactive systems to provide help with dynamic searching and organizing the ever-expanding sea of Web pages.

2.4 Leveraging Information Seeking on the Web

One such application is the Remembrance Agent project from the Agents group at the MIT Media Lab. The Remembrance Agent is a program that augments human memory by displaying a list of documents which might be relevant to the user's current context. It compiles both personalized indexes and group listings (with permission) of information that a Web Information Seeker encounters. It then suggests further pages, most likely on the same Web server, that should interest the user (Rhodes and Starner 1996). Looking for Scalable Models on the Web

Pitkow and Recker directly address the measurement of remembrance on the Web. They sought to find an algorithm for caching Web pages in memory for faster access. They found that basing the model on psychological research on human memory (specifically, the retrieval of memory items based on frequency and recency rates of past item occurrences) predicted document access with a high degree of accuracy (Pitkow and Recker 1994a). What this implies is that a single human memory model in fact scales up to address multiple users accessing a Web site. They dutifully point out that in 1994 Glassman modeled Web page access rates as a Zipf distribution that coincides with their algorithm and the concepts in the previous section of this paper.

However, as Tauscher and Greenberg point out, since the "Back" button is the most-frequently used action in any Web browser session, 58% of the pages visited by an individual are re-visits (Tauscher and Greenberg 1997). Therefore, we can not be sure if this type of behavior is truly Zipfian or heavily influenced by Web browser functionality. Additionally, Tauscher and Greenberg go on to reveal that users also only access a few pages frequently and browse in very small clusters of pages. However, whether this produces any kind of predictable distribution seems dependent on the number of total pages visited and the user's Information Seeking style.

The idea that Web Information Seekers only access a small number of the possible Web pages is partially a function of the growing mass of Web pages available. Gerhard Fischer and Curt Stevens echo this by trying to overcome the information overload problems associated with complex, poorly structured information spaces. While they focus on the domain of Usenet News, their conceptual framework seems to coincide with the Web as both fit their definition of (large) poorly-structured information spaces. They make explicit what experienced Information Seekers already do: develop a personalized information structure to make sense of the mass of information by forming virtual newsgroups with the topics they are interested in and modifying their usage patterns based on their history of success or failure (Fischer and Stevens 1991).

Again, they lead us to a local memory model with a proposed system based on:

"Anderson's Rational Analysis of Human Memory. His analysis leads to the definition of several effects that help measure how effectively the history of usage patterns predicts current usage patterns, or the probability that an item is needed given the history of need for such items. ... The three effects that Anderson discusses are frequency, recency, and spacing. Frequency refers to the number of times an item is needed within a specified period of time. Recency refers to the amount of time elapsed since the last need for an item, and spacing refers to the distribution across time of the exposures to these items. By holding certain factors constant in specific situations, equations (power functions) are developed that serve as predictors of the current need for an item that has been needed in the past. For example, an item that has been needed 10 times in the last month is more likely to be needed now if its past exposures are distributed evenly across the time span than if all of the past instances occurred in the first three days of that month. Similarly, if an item was needed once a long time ago the probability of current need is less than that for an item that was used very recently."

Fischer and Stevens' system, INFOSCOPE uses agents to monitor two general areas of system usage: a user's individual preferences in daily operation of the system and the actual content of messages that are deemed interesting or uninteresting. This information can be gathered by accumulating data about an individual's Information Seeking behavior. Gradually, a profile of the user is built, which can be used to augment the Information Seeking process.

3.2.2 Web Browsing Studies

There are few rigorous studies of Web browsing behavior despite the Web's growing popularity as the most-used Information Seeking system ever. One reason that these studies are difficult is obtaining a complete set of data describing the Web browsing sessions.

Lara Catledge and Jim Pitkow were the first to publish a major study of Web behavior in 1995 by coming up with what is now the standard method of collecting Web browser data. They modified the source code for a version of XMosaic, the dominant X Windows browser at the time. They configured the browser to generate a client-side log file that showed user navigation strategies and interface data. They released this modified browser to the Computer Science department students who ran Mosaic from X Terminals in the various computing labs. Their results were measured by using a task-oriented method. They determined session boundaries artificially by analyzing the time between each event for all events. The resulting time between each user interface event was 9.3 minutes. "In order to determine session boundaries, all events that occurred over 25.5 minutes apart were delineated as a new session. This means that most statistically significant events occurred within 1-1/2 standard deviations (25.5 minutes) from the mean. Thus, a new log file was derived that indicated sessions for each user. Interestingly, a consistent third quartile was observed across all users, though we note no clear explanation for this effect."(Catledge and Pitkow 1995)

This method is currently the only acceptable way to begin to determine session boundaries. Interestingly, the study did reveal some unexpected results. Pages users bookmarked did not match the most-popular sites visited as a whole from the group. Also, only 2% of Web pages were either saved locally or printed. Of course, these results could due to dominant environmental factors, such as the capabilities of bookmarking in XMosaic or the availability of printers. Catledge and Pitkow also hypothesize that users categorized as "browsers" spend less time on a Web page than "searchers".

3.3 Web Browser Data Sources

Studies like that by Catledge and Pitkow call into question the collection and availability of data on Web browser usage. While there have been many methods used to collect data and browser-support files available, the primary data currently most studied are bookmark files and browser history files.


Many studies attempt to discover the boundaries of bookmarking behavior. Most notably, Abrams (Abrams 1997) argues that bookmarks are the personal information space of a Web user as they attempt to fight five problems of complex information spaces (the Web being one):

  1. users can prevent information overload by incrementally building a small archive with bookmarks.
  2. users can avoid pollution of their bookmarks by selecting only a few useful items and creating a known source of high value.
  3. users reduce entropy through maintenance of bookmarks and organize bookmarks only when necessary.
  4. users add structure by cost-tuning their information environment, such as making a limited number of visits to any Web site due to available time, they can directly go to a particular page without having to visit the main Web site page.
  5. users compensate for lack of a global view of the Web by creating their own personal view.

These insights are all credible and show the value of a bookmarking feature in any Web browser. Abrams reports fairly that despite bookmarks utility, retrieval/revisiting is difficult for users because bookmark names aren't always clear and any arbitrary organization structure set by the user can be ignored or forgotten over time. Bookmarks as Information Seeking Resources

As most users simply bookmark a Web page, accepting the provided bookmark name taken from the HTML <TITLE> tag embedded in the Web page, the name is most often used to identify the site from other pages on the Web site, not as a retrieval cue. For example: most sites only use a overall defining title on the homepage of their site. The subsequent pages are usually titled in relation to the homepage only. Page titles that become bookmarks such as "What's New" do not provide much contextual information to a Web browser who without, further investigation will not know the particular site referred to.

Combine this with the fact that most bookmark string lengths are limited, Web pages with seemingly helpful titles such as "Welcome to the homepage for Information Seeking on the Web" end up saved as a bookmark titled "Welcome to the homepage for Info" which is less than helpful for any well-traveled Web user. Of course, all newer browsers support longer bookmark strings, provide a small notes field, and even allow a bookmark name to be changed locally. However, that improvement has been reduced by the many new dynamic Web site generating applications which do not generally supply good Web page titles.

However, my hypothesis is that so few users utilize these additional features, not to mention never organizing their own bookmark collection, that bookmarks as a retrieval mechanism to aid Information Seeking are not nearly as useful as their potential.

Nonetheless, I contend that the most valuable file with regards to time invested creating it on any Internet user's hard disk is her bookmark file. Most likely this file has been around since the browser was installed and comprises the results of up to hundreds of hours of Web browsing, far longer than any user typically spends on any singular document file. This makes the bookmark file, no matter what its state or utility, a valuable data source for understanding Information Seeking behavior on the Web. Augmenting Bookmarks

Again, since studying the Web is relatively new, there is little information on studies seeking to improve bookmarking capabilities. Maarek and Ben Shaul's 1996 study: Automatically Organizing Bookmarks per Contents (Maarek 1996) argues that "the explosive growth in the Web leads to the need for personalized client-based local URL repositories often called bookmarks. We present a novel approach to bookmark organization that provides automatic classification together with user adaptation."

First, they re-confirm that URL classification can be difficult and tedious, if undertaken and also again, that many bookmarks have bad titles. They point out that :

"Typical bookmark repositories contain bookmarks which are collected by a single person with a limited and well-defined set of interests (compared to multi-user repositories). This implies that a repository is likely to have an inherent partitioning into groups with high intra-group similarity and low inter-group similarity. Such organization just needs to be discovered. Hence, providing automatic assistance in organizing bookmark repositories by their conceptual categories would be extremely beneficial."

Their goal is to develop a new kind of bookmark organizer application to help "discover" this organization. Their system is based on a repository of bookmarks with checksums that identify a bookmark's age or relationship to other bookmarks. Then, when checking or organization is needed, each bookmark's HTML document is collected and indexed; computed for pairwise similarity with other documents; and organized by clusters.

These techniques are the foundation of capturing, organizing, and coordinating bookmarks that almost all research, commercial, and ad hoc bookmark organizing tools use.

3.3.2 History Logs

Alison Lee gives us a general definition of history tools in her thesis studying the design, support provided by, and effort needed to utilize history tools:

"History tools allow a user to access past interactions kept in a history and to incorporate them into the context of their current operations" (Lee 1992).

History tools should enable user to expend less physical and mental effort in recalling or repeating their past actions.[6]

For the World Wide Web, history logs provide the following types of information about a Web browsing session:

  • title of the Web page visited
  • URL (Uniform Resource Location or address) of the Web page visited
  • time (if less than 24 hours) or date of the first visit to a Web page
  • last visited time (if less than 24 hours) or date of the Web page
  • number of visits to the Web page
  • expiration of when the Web page should no longer be cached on the client system (this is rarely used)

Browser software uses the history file to coordinate the "Back", "Forward" and "Go" functions of the browser. The file is also used as a pointer to the possible cached files on a user's local system. Every page visited is stored in the cache, even if it is not bookmarked. Obviously, this data is a valuable asset in attempting to understand (and then augment) Web Information Seeking behavior. Web Browser History Files

Unfortunately, few relevant studies that make use of a Web browser's history file are available. Most notable is Tauscher and Greenberg's 1996 evaluation of World Wide Web history mechanisms (Tauscher 1996a). From a user's perspective, they conclude that "improved history support within browsers can reduce the cognitive and physical burdens of navigating and recalling to previously visited Web pages". They find that 30% of all browser navigation is via the "Back" button. However, they also admit that currently, we can not really know how often people revisit pages. As with the 1995 Catledge and Pitkow study, they modified a version of XMosaic to record user behavior. At that point, Web browser versions did maintain a history of recent pages visited between uses of the Web browser, essentially clearing the "Go" menu of any previous pages each time a new browser session is begun. (However, they discovered that despite having this additional function to make navigation easier, only 3% used it.) Using the modified XMosaic browser at least provided more detailed analysis than was currently available.

In relation to navigation analysis of particular Web sites, Tauscher and Greenberg also question how much the Web site design influences the navigation -- frames or many-leveled sites are obviously harder for a user to envision than a simple flat, small set of Web pages. Since there are no standards for Web site design, any collected data is heavily biased by the types of sites users visit. Some guidelines for improved history file formats are also noted including showing a Web page only once; pruning startup and search pages (if the search terms are not saved too) to eliminate easily-reached pages; and making some kind of notation that hints at the structure of the overall Web site the page belongs to.

Not only are these guidelines useful for designing a system to augment Information Seeking, but they obviously also point to what tasks are important to capture during a browsing session.

Based on this study, Tauscher and Greenberg provide some empirical data on what makes an effective history mechanism (Tauscher 1996b). They state that since 58% of Web pages are revisited, "Web navigation is classified as a recurrent system. Hence, a history mechanism has value...". They further discovered that even a list of the previous 10 URLs visited can be useful in aiding navigation and more entries do not improve coverage enough to make up for the overload of more entries.

In the spirit of increasing utility, they contend that a history mechanism should make it easier to revisit pages than to navigate back using other methods. For example, they recommend that Web pages already recalled via the history mechanism should be even more easily selectable, groups of related (by URL mostly) pages should be clustered in hierarchical menus, users should be able to customize their history view, and it should be possible to easily search the entire history file.

In 1994 Cooper also analyzed a group of 98 Mosaic and Netscape history files for analysis of common readership (Cooper 1994). In this case, there was no customized browser, the log files were copied from subjects accounts (with their consent) by the system administrator. The common readership among pages were surprisingly low, a vast majority of pages were only seen by one person (85.54%). Despite this finding, this study again illustrates the utility of using history files to study Information Seeking on the Web. In networked environments where files are kept on servers, this method of data collection can be convenient.

I am convinced that history logs are both the least-utilized and most-promising collection of the available Web browser-support files that can tell us more about in individual's Information Seeking behavior.

3.3.3 Challenges in Collecting Web browser data

The openness of Internet standards has a down side when attempting to collect data on Web browsing. Pitkow and Recker (Pitkow and Recker 1994b) succinctly point out that:

"The range of hypertext systems continues to expand, from custom-tailored, closed systems to dynamic, distributed, and open systems like the World Wide Web (WWW). The shift from closed to open systems results in a corresponding decrease in the effectiveness of metrics and techniques for providing intelligent hypertext to users. Essentially, the locus of control shifts away from developers towards users. Stated differently, the central question becomes how chaotic, loosely constrained environments, like the World Wide Web, can provide intelligent (information about) hypertext(usage)."

Despite this difficulty, they argue that there are some methods for obtaining metrics. They propose both top-down and bottom-up implementations of their ideas. For the bottom-up view, they examined individual user link usage patterns. Once again, an XMosaic browser was modified to log the time and the hyperlink that each user traversed. By looking at this data from a top-down view and looking for trends between users, they found a strong power relationship between frequency and probability of access. This is much like the earlier idea of individual behavior models scaling up to describe group behavior, their findings matched closely with current research of memory use. This coincides rather well with what history tools should provide for the user -- augmented memory.

3.4 Server Log Data

Another source of data about Information Seeking on the Web is through server logs. These logs are controlled by applications that deliver files to Web browsers.

On most servers, logging options are minimized to reduce system overhead, but to collect complete data, I argue the performance hits are a small price to pay for the wealth of accurate information collected. Even periodic log collection can prove useful when analyzing Web usage.

Another (albeit expensive) option is to use another system networked to the server to gather log information. Alternatively, some clever Web server administrators (often called Webmasters) use an array of servers to serve specific types of content only allowing the "main" Web server to gather native log information.[7]

Other internal server information can be collected such as email logs to and from the system administrator (the data size and times -- not necessarily the content of the mail), program usage logs (even at the maintenance level), and disk usage (direct system information about various cache configurations).

Log information can be further augmented externally from the server by using programs that compare server usage to others or to gather more information via nslookup, ping, who, finger, and other network identification programs.[8]

3.4.1Web Servers

Web servers are the dominant type of internet server that collects log information on the files served to Web browsers. In general, Web servers can send any media format they are configured for as MIME (Multipurpose Internet Mail Extensions) types. Standard formats include text, graphics, CGI (Common Gateway Interface) programs, and Java applets. The latter two are scripts or programs that execute on either (or both) the client or server to perform some extra function. Obvious implications of adding programming logic to a server system include affecting performance, content organization, and enhancing server capabilities. However, most of these programming techniques circumvent server logging because typical system logs only capture basic server procedures (primarily HTTP GET, POST, and SEND). With the added complexity of additional operations, these standard procedures can be skipped or incorrectly logged compared to the actual function of the server (Lui 1994). Web Server Log Accuracy

As Web servers are the primary method to collect Web user data about a particular Web site or (less so) about Web usage in general, it is crucial to understand the problems of gathering accurate server information. One of the primary problems with gathering Information Seeking behavior information on Web servers is that no state information is captured between a client's requests to the server. The current server protocols treat every access as separate, by default no user-specific array of requests for the Web server's resources are recorded. The Web server does not know who is requesting a file from one request to the next, there is no relationship of service between the client and server.

Another problem is that server requests themselves do not represent true usage. Stout (Stout 1997) gives us definitions of the hierarchy of Web server interactions.

Hit - a hit is when a single file is requested from the Web server.[9] (This can be the HTML text of the page itself or a single graphic file like a page logo or a navigational arrow graphic that the HTML file references.)

View - a view is when a Web browser user gets all of the information contained on a single Web page. (The page is "viewed" by the user.)

Visit - a visit is one series of views at a particular Web site. (Going to a Web site and reading one page, following a link to another page, and so on until going to another Web site or quitting the Web browser.)

These different interactions show that commonly-used Web counters are most likely inaccurate too. Counting the files (hits) served alone does not really tell how many people have visited a Web site. Redundant or tangential files served (such as small graphics used as list bullets) are all considered as important as each other data in the logs.[10] Web server accuracy is also possibly affected by ignored requests for files from the Web server due to congestion when receiving requests. Requests can also be lost or expire while being routed along the Internet to the site. Some users may also be turned away when the number of connections the Web server is set to handle are maximized.

One final threat to valid metric information is the advent of software robots or Web spiders (used by Web search engines or indexers) that can access a site and any number of its resources -- even multiple times. What these host of problems show is that much of the data currently gathered about Web usage is highly dubious. However, with combinations of server logs and appropriate design of system resources, more relevant data can be gathered.[11] Proxy Servers

A proxy server is a server that stores and locally serves a cache of files received from Web servers on the Internet for a group of Web users. This description is from Netscape Communications Corporations Proxy Server documentation:

"Proxy Server is high-performance server software for replicating and filtering content on the Internet and intranets. Proxy Server uses an efficient replication model to cache web content locally, to conserve network bandwidth and give users faster access to documents. Proxy Server also enhances network security, because network packets do not pass directly between users' systems and the Internet. However, users' access to Internet services remains transparent."(Netscape 1997)

This means that the proxy server saves local copies of the files a Web user requests and records aggregate statistics about the files it gets for each user (and any others using the proxy server). Many corporations use proxy servers to keep local copies of Web pages many of their employees visit regularly. This means as a Web browser user requests a Web page, the proxy server gets the request and if the page is stored on the local proxy, the page is sent immediately to the user instead of retrieving it from the Internet. (Of course, a proxy server only keeps a Web page in its cache only so long to ensure that the up-to-date Web pages are served to the Web user.)

Proxy servers are rich sources of data because each Proxy server keeps a log of the files each user requests. Access times and frequency of requests for Web resources for all of the proxy's users are all stored in one file. This makes an excellent source for wide-scale analysis of a group or user's Web usage patterns. Network Servers

Network servers are the programs that control system-level networking security and access on the server. (Spainhour 1996) Network servers control the data sent and received to and from the Internet. The primary way to gather Web usage behavior with Network Servers is to "sniff" the TCP/IP (Transmission Control Protocol/Internet Protocol) packets of data as they move through the network. Obviously, this requires a lot of translation to result in readable data logs. The packets that contain Web-related requests from browsers (identifiable as HTTP messages) are extracted and saved to a log file.

Marc Abrams and Stephen Williams propose that network logs alone do not provide much additional data unless used in conjunction with other logs or surveys. They propose a client monitoring setup as mentioned above, but also a setup for monitoring the network packets that are sent and received by a Web server or group of Web servers (Abrams 1996). This can provide additional information to Proxy Logs as requests not normally picked up by the proxy server, such as the exact URL requests. Network server logs also could be used to gather information from Web users who access the internet through a pooled networking setup like Internet Service Providers use.

3.4.2 Server Log Studies

Because conducting server log studies usually requires programming or system setup skills native to only a small set of researchers, there are not currently a significant volume of studies.

One type of study is the intra-site examination of how users interact with a particular Web site. There are several current commercial applications that gather this information for either marketing data on a Web user or to customize the Web pages a particular user sees. Customizing a Web site visit for a user is useful in that it can help us discover Web user behavior about interaction methods, content display, and privacy concerns.

Yan, et. al. (Yan et al. 1995) present an approach for automatically classifying visitors of a web site according to their access patterns.

"User access logs are examined to discover clusters of users that exhibit similar information needs; e.g., users that access similar pages. This may result in a better understanding of how users visit the site, and lead to an improved organization of the hypertext documents for navigational convenience. More interestingly, based on what categories an individual user falls into, we can dynamically suggest links for him to navigate. In this paper, we describe the overall design of a system that implements these ideas, and elaborate on the preprocessing, clustering, and dynamic link suggestion tasks. We present some experimental results generated by analyzing the access log of a web site."

Through analyzing user access logs with clustering techniques they can discover categories that might not be thought of by the Web site designer and may have been impossible to have predicted given the relatively constricted navigation present on most Web sites. In Zipfian fashion, they found clusters that accessed pages physically linked together. If Web links were more broadly established, would this clustering still remain constant?

As the content of many Web sites are in flux, analysis that provides automatic clustering and dynamic linking can improve a static Web site design.

"During a session, a user may show varying degrees of interests in these items. If there are n interest items in the web site, we may represent a user session as an n-dimensional vector, the i-th element being the weight, or degree of interest, assigned to the i-th interest item. If we view an HTML page as an interest item, then we can give it a weight equal to the number of times the page is accessed, or an estimate of the amount of time the user spends on the page (perhaps normalized by the length of the page), or the number of links the user clicks on that page."(Yan et al. 1995).

If techniques like these are performed on proxy or network logs, it might be possible to determine an individual's Web browsing behavior among more than one Web site.

However, accurately tracking users among Web sites is still problematic. One approach to combating this is to evoke "cache-busting" or to disable all the Web caching available along a Web user's chain of access points to a Web server. By using a series of "do not cache this request" messages, each Web user's query would be represented truly by a logging mechanism.

A current set of proposals, called "hit-metering" address some these difficulties. Mogul and Leach propose an extension to HTTP header called "Meter" that contains some user information that can be passed through some of the various caching schemes used on the Internet (Mogul and Leach 1997).

Pitkow (Pitkow 1997) reviews some sampling approaches to measure Web usage:

day sampling - stop caching on certain days (to minimize network usage) and analyzing from this more reliable data.

IP sampling - canceling caching by the requesting Internet Protocol (IP) address of a Web user and measuring results from these users only.

continuous sampling - using Web client cookie files (a small identifying file a Web server can write to a Web browser that identifies a user only to the cookie-assigning Web server) to disable caching for each single user participating in the survey.

It is possible to discover discrete user visits to a Web site or a profile of an entire Web browsing session is possible. Pitkow suggests a reasonable idea that "appropriate time-out periods are typically determined by inspection of the distribution of time between all page requests to a site" (Pitkow 1997) .

It may also be possible to discern the actual time a user spends on a particular Web page. If "a statistically valid sample size is determined and the data collected, reasonable statements about the reading times of pages can be made. Catledge and Pitkow 1995 first reported a session time-out period of 25.5 minutes, which was 1.5 standard deviations from the mean of 9.3 minutes between user interface events. A time-out period of 30 minutes has become the standard used by log file analysis programs."

As more server logs and resulting statistics are collected, further progress in innovative measuring techniques may enable us to determine even more about a user's Web browsing behavior. Alternate Views of Server Logs

However, statistical studies are not the only way to learn more about server log information. Some studies are developing better visualization for current Web log results. Jim Pitkow and Krishna A. Bharat present a tool, WebViz, that provides graphical views of Web server logs (Pitkow and Bharat 1994). WebViz has two main features:

  1. using a path-through-the-Web-site view (a Web-Path paradigm) that shows both the documents and links traveled through on the site.
  2. filtering the log by a user`s Internet address, certain documents on the Web server or by beginning and ending times per Web site visit .

Cleverly, they use colors that match the "non-linear cooling curve of a hot body; from white hot to yellow hot through red hot to blue" and size to show the recency and amount of node (Web page) access. Temporal information is regulated with sliders or playback controls, much in the spirit of Shneiderman's HomeFinder interface (Williamson and Shneiderman 1992).

The goal of all of the methods described in this section is to gather information about a Web user's Information Seeking behavior. This data can then be used to augment the Information Seeking process. The next section describes tools and concepts that perform this augmentation function, known as Collaborative Filtering systems.

4. Collaborative Filtering

This section provides a general introduction to Collaborative Filtering and presents recent significant studies and systems for Information Filtering using both the Internet and the World Wide Web. Also included are studies that illustrate general Collaborative Filtering techniques and a review of current Collaborative Filtering systems for both the Internet and the World Wide Web.

Collaborative Filtering is a way to use others` knowledge as recommendations that serve as inputs to a system that aggregates and directs results to the appropriate recipients. In some Collaborative Filtering systems, the primary action is in the aggregation of data, in others the system's value is the ability to make good matches between the recommenders and those seeking recommendations (Resnick and Varian 1997).

The term "Collaborative Filtering" was first used by the developers of the Tapestry system. While "Recommender System" is an often-used synonym now, I prefer to use the term "Collaborative Filtering" to focus more on the process of filtering than eliciting user's recommendations about a topic. I view Collaborative Filtering as a quantitative method to gather qualitative data.

Collaborative Filtering systems, by their existence, admit that it is hard to design an application to filter information. With the relative failures of Artificial Intelligence in filtering techniques, we are left to fall back on other strategies to help us make sense of the growing world of information.

Collaborative Filtering systems range from simple thumbs-up/thumbs-down ratings to a matrix of codified attributes to descriptive textual information. Recommendations may be gathered explicitly or gathered implicitly; they may be personal or anonymous. Almost any kind of item or information can be the subject(s) of a Collaborative Filtering system. The focus of this paper being Information Systems, most notably the World Wide Web, this narrows the review to mainly Internet or Web-based systems or Internet of World Wide Web data.

This section covers a wide range of ideas related to Collaborative Filtering: Information Filtering, the root of Collaborative Filtering; studies that impact the design of Collaborative Filtering systems; and finally examples of Collaborative Filtering systems.

4.1 Information Filtering

The use of information systems to filter electronic information is nothing new. However, in the past couple of decades, the shift in systems design has focused on user interaction, bringing more work regarding filtering to light.

The problem of information overload on the Internet has brought new attention to the ideas of filtering information. Everyone agrees that help to organize or present information on the Internet is a much-needed method of making using the Internet easier.

Keys to filtering internet information are already obvious: looking for patterns or keywords in data and letting user feedback guide future searches. Belkin and Croft (Belkin and Croft 1992)point out that at an abstract level, Information Filtering and Information Retrieval are very similar. They describe Information Filtering as delivering information to people who need it. This information is most likely unstructured data and not from a controlled database. The information itself is primarily textual information, is in large amounts, and is constantly incoming. They also point out that filtering is traditionally based on user profiles, typically representing both the user's long-term interests and indifferences. As this implies, filtering is meant to mean removing data from being seen by the user, not finding data.[12]

Some basic filtering techniques include using descriptions of information (including its source), user profiles, algorithms or comparison tables, considering the source of the material, and comparing histories of queries or use. Common filtering methods would include Boolean searches, keywords descriptors, vector matching, probabilistic (statistical) models, and log file analysis.

A variety of systems using these techniques and methods are described below. Some general, earlier systems for filtering information on the internet are reviewed, followed by the more relevant Web-based filtering systems.

4.1.2 Information Filtering on the World Wide Web

As the focus of this paper is Internet information, filtering systems that help to overcome information received from the internet are described. The main differences between these systems and Collaborative Filtering systems are that these systems rely on only the individual`s feedback. Taken from another perspective however, I am becoming convinced that in a way, these systems essentially build a collaborative relationship between the past behaviors of the individual user. Essentially, the user is collaborating with himself to gradually, adaptively improve his filtering preferences. What this implies is that no filtering system idea can be classified truly as either personal or collaborative. The same types of techniques can be used on an individual or aggregate basis.[13] Information Filtering Systems

In contrast to the many Information Retrieval systems that use filtering, the Information Filtering systems here place a great deal of importance on the interface of the system. With the masses of rapidly changing material on the Internet, these systems also focus on rapid analysis and presentation of information.

Rhodes, et. al. (Rhodes and Starner 1996) designed Remembrance Agent, a filtering system that is a being continuously running system.

"The Remembrance Agent (RA) is a program which augments human memory by displaying a list of documents which might be relevant to the user's current context. Unlike most information retrieval systems, the RA runs continuously without user intervention. Its unobtrusive interface allows a user to pursue or ignore the RA's suggestions as desired."

The idea of being continually running is particularly applicable for an agent or assistant program. With more powerful desktop computers, most of a computer's CPU time is spent waiting for the user to hit the next keystroke, read the next page, or load the next packet off the network. There is no reason it can't be using those otherwise wasted CPU cycles constructively by performing continuous searches for information that might be of use in its user's current situation.

The Remembrance Agent has two main processes, one that continuously watches what the user types and reads, and the other that finds information which is somehow related to the user's actions. This dynamically filtered information is then displayed for the user. Information Filtering-Oriented Studies

Malone (Malone et al. 1987) provides a good overview of the types of filtering possible as he developed The Information Lens, a prototype intelligent information-sharing system. He points out these approaches to filtering large volumes of electronic information:

  1. cognitive filtering - attempts to characterize the contents of a message and the information needs of the recipient.
  2. social filtering - relationship between the sender and the recipient
  3. economic filtering - implicit cost-benefit analyses to determine what to do with a piece of electronic mail.

This paper serves as one of the foundation papers for future research into filtering. Note that the term "Collaborative Filtering" is not even used at this point. Instead of a more Information Retrieval approach to filtering, focus was centered on designing quality user interfaces that could support both an individual and a group. With the advent of the Web and its graphical user interface, some of the initial design decisions of a filtering system are turning again toward interface design.

Morita and Shinoda (Morita and Shinoda 1994) study another approach to filtering information, monitoring the behavior of a user while viewing information. The information in this case is netnews, controlled and populated by internet newsgroup servers. In a longitudinal study of eight users reading eight thousand netnews messages. they observed a strong positive correlation between the time reading a message and the user interest in the message. Of course, this seems obvious, but further studies might reveal specific interest levels as functions of the time spent reading a message. Techniques like this can provide firm ground to establish links between a user's Information Seeking behavior and the value of each element of information.

4.1.3 Information Filtering and Retrieval on the World Wide Web

Filtering World Wide Web information which uses HTML, while being difficult, does have advantages over plain text or descriptive text. As explained in Bibliometrics and the Web, earlier in this paper, each Web document has a series of characteristics that help us identify it:

  • the URL of the document tells us the type of server it is on, who is providing it, and may contain additional clues to it's content by the directories and file name it is stored as.
  • HTML tags, at minimum tell us the title and major headings of the document
  • the date and time of document's creation or update can indicate how recent the information it contains can be
  • the document or parent server might be classified by one of the Web's catalog services such as Yahoo! or Excite.
  • the document may have been or can be indexed to provide some insight as to its content.

This more technical data, in addition to the concept of the "document fingerprint", can at least begin to describe the document. Then as a user's preferences are known, these document characteristics can be used as criteria to include or exclude a document.

The Softbot Lab at the University of Washington has a number of systems that can be used to filter information. Most notably, the Ahoy! page finder that features the concept of "Dynamic Reference Sifting" (Shakes 1996) provides a sampling of the ideas used to filter Web information.

Ahoy! uses Dynamic Reference Sifting to find personal homepages. With a person's name and organization, Ahoy! uses these key elements:

"1. Reference source: A comprehensive source of candidate references (e.g., a web index such as AltaVista).

2. Cross filter: A component that filters candidate references based on information from a second, orthogonal information source (e.g., a database of e-mail addresses).

3. Heuristic-based Filter: A component that increases precision by analyzing the candidates' textual content using domain-specific heuristics.

4. Buckets: A component that categorizes candidate references into ranked and labeled buckets of matches and near misses.

5. URL Generator: A component that synthesizes candidate URLs when steps 1 through 4 fail to yield viable candidates.

6.URL Pattern Extractor: A mechanism for learning about the general patterns found in URLs based on previous, successful searches. The patterns are used by the URL Generator."

Applying such a combination of techniques seems to be the trend in Filtering information on the Web. This type of resource triangulation is one way to combat the rapidly modifying standards of Web document serving on the Web.

The current growth of automated Web site building applications has made using search engines alone less useful. Many Web pages are served dynamically, as requested by a Web browser. They have no static representation on the Web which search engines can index. Even the manually updated directories of Web information are falling behind the glut of new information appearing. This multiple-technique approach seem likely to be the best method to locate and organize Web resources in the future. WEBFIND

Another filtering tool is WEBFIND (Monge and Elkan 1996) used for finding scientific papers on the Web. WEBFIND uses a new approach to sorting the many references to a paper on the Web. The approach is to use a combination of external information sources as a guide for finding where to look for information, notably the Web tools, MELVYL (bibliographic database at the University of California) and NETFIND (a "white pages" service that provides internet host addresses as well as applicable email addresses). WEBFIND combines the information gathered from each resource to find the paper itself or the best path for locating the paper. WEBFIND filters and composes queries in real-time, providing an element of interactivity to the process. If a path does not appear to the user as leading toward the information, that path can be filtered out of what is displayed next

Tools like WEBFIND that capitalize on another information resource on the internet will undoubtedly be one of the techniques used by other filtering applications in the future. As internet searching and filtering tools get more robust, each tool can be used in according to its strength to help sort and locate information. Hopefully, new tools will support formats that enable easy exchange of information, accelerating this trend. Letizia

Henry Lieberman (Lieberman 1995) developed a personal filter in an agent call Letizia. Letizia is a user interface agent that assists a user browsing the World Wide Web by tracking usage behavior and attempting to anticipate items of interest by doing concurrent, autonomous exploration on links from the user's current Web page. Letizia automates a browsing strategy consisting of a best-first search augmented by heuristics inferring interest from browsing behavior.

Lieberman uses a simple concept with Letizia: the past behavior of the user should provide a rough approximation of the user's interests. Obviously, information that is ignored by the user is filtered, or not recommended, by Letizia. Interest is noted when the user repeatedly returns to a document or spends a lot of time browsing the document (relative to its length). Not only is this fact used, but a very clever idea first seen in Letizia: "since there is a tendency to browse links in a top-to-bottom, left-to-right manner, a link that has been 'passed over' can be assumed to be less interesting. A link is passed over if it remains unchosen while the user chooses other links that appear later in the document. Later choice of that link can reverse the indication."

These ideas are powerful first steps using the agent concept and watching user behavior when browsing Web pages. However, some improvement might be necessary to determine when a user is no longer attending to his Web browser and to compensate for a bias against more complex pages (pages that take a longer time to read or digest than average).

4.2 Collaborative Filtering Studies

Few independent studies have reviewed the whole process of Collaborative Filtering. Some focus only on statistical methods and others on interfaces.

Hill, and others (Hill et al. 1995) provide a thorough review of CF issues and their influences on each other. They point out that when

"making a choice in the absence of decisive first-hand knowledge, choosing as other like-minded, similarly-situated people have successfully chosen in the past is a good strategy--in effect, using other people as filters and guides: filters to strain out potentially bad choices and guides to point out potentially good choices. Current human-computer interfaces largely ignore the power of the social strategy. For most choices within an interface, new users are left to fend for themselves and if necessary, to pursue help outside of the interface."

They use a method which forms the basis of Collaborative Filtering, working from a set of user preferences, this is then used to automate a social method for informing choice and observe how it fares in a test involving the selection of videos from a large set. They used algorithms for two functions, one for recommending and another for evaluating the videos. After trying a few versions for each, they found that the algorithms (a Pearson r correlation) were good, but note that they were most likely not the best possible--hinting at a future study. They state that these types of measures in fact put a bound on prediction due to the difficulty involved in predicting what a user selected once based what they selected from the same question last time.

"One of the standard uses of reliability measures is to put a bound on prediction performance. The basic idea is since a person's rating is noisy (i.e., has a random component in addition to their more underlying true feeling about the movie) it will never be possible to predict their rating perfectly. Standard statistical theory says that the best one can do is the square root of the observed test-retest reliability correlation. (This is essentially because predicting what the user said once from what they answered previously has noise in at both ends, squaring its effect. The correlation with the truth, if some technique could magically extract it, would have the noise in only once, and hence is bounded only by the square root of the observed reliability). The point to note here is that the observed reliability of 0.83 means that in theory one might be able to get a technique that predicts preference with a correlation of 0.91. The performance of techniques presented here, though much better than that of existing techniques, is still much below this ideal limit. Substantial improvements may be possible. "

Hill and his team found positive results outperformed using movie critics to select videos. They also noticed a bias towards positive ratings.

In another study, Wulfekuhler, et. al. (Wulfekuhler 1997) examine techniques that detect common features in sets of pre-categorized documents, to find similar documents on the World Wide Web. They found that extracting word clusters from the raw document features proved successful in discovering word groups that can be used to find similar Web documents.

When comparing these results to pattern recognition via training samples from known categories to form a decision rule for unknown patterns, they found not knowing the number and labels of categories of Web documents made accuracy difficult. They discovered that clustering techniques are unsupervised, requiring no external knowledge of categories and only group similar patterns into clusters whose members are more similar to each other (according to some distance measure) than to members of other clusters.

They also point out that pattern recognition requires a greater number of documents. For their "4 class problem and 5190 features, this means we would need one to two hundred thousand training documents. Clearly this is not possible for this domain." They note that there may be additional techniques that can help to reduce the number of training samples, but seem content to work with word clustering. Here we see another method of analysis that helps to advance Collaborative Filtering technology and is both relatively simple to implement and effective in helping to locate Web documents.

4.3 Collaborative Filtering on the Internet

The first generation of Collaborative Filtering software on the Internet was before the World Wide Web existed. These filtering systems mainly focused on netnews (Usenet news, also known as newsgroups) and email.

4.3.1 Tapestry

The most important of these early systems is the Tapestry system developed at Xerox PARC to deal with the growing amount of electronic mail and newsgroup traffic. The Tapestry designers state their idea simply: "A basic tenet of the Tapestry work is that more effective filtering can be done by involving humans in the filtering process" (Goldberg et al. 1992).

Tapestry users record their reactions to documents they read by making annotations that can be accesses by others' filters. Tapestry not only stores these annotations, but also houses an individual's email and netnews. This enables the system to both act on filters and deliver resources as they are requested. For example, a user might want to find out about a topic. Instead of relying on keyword searching alone (which can be used to assign filters to documents itself), she might look for documents that not only have the correct keywords, but that also have at least three endorsements from other Tapestry users. As these new documents are being read, the user can endorse them herself. For additional help, a user can also search a user's public annotations to help in finding a document. All of these filtering methods can be automated by making them a "query filter" for the user.

Tapestry operates using a client/server model. A reader/browser lets the user use the filtering and annotating features, or a user can keep using his own mail reader and use the Tapestry server (as a remailer) as the coordinator for his preferences. A Tapestry Query Language (TQL) is the means of running the filters on information. Cleverly, the filters themselves are provided as queries on the system, enabling one query to contain more than one filter. TQL also supports advanced queries that can provide some conditional processing and filtering based on the attributes (date, time, sender, etc.) of a document.

The query and storage features make Tapestry a powerful filtering system in itself, but when combined with the collaborative capabilities, it truly provided something new in dealing with the influx of information at the time.

4.3.2 Lotus Notes

Another system that is used as a foundation for Collaborative Filtering techniques is Lotus Notes. Typically thought of as the standard for groupwork in corporations, Notes can also be used for Collaborative Filtering among its users. In some ways, from a user-profiling standpoint, Notes users may have more in common than the wide variety of interests and motivations that the population of Web users might have. Since all of the Notes users are working for the same business, they should have similar goals and information interests.

With in-house resources as a Notes database, Lotus provides a feature to let people annotate documents and send them to others (Ehrlich 1995). A set of links to these documents or their comments can be used like a list of URLs on a Web page, an ad hoc categorization of information. Thus, individuals can play the role of the gatekeeper and make distributing information much more dynamic.

Through InterNotes Navigator, the built-in Web browser in Notes, the set of keywords and indexes prepared for these internal documents can also be used as initial comparison criteria for the Web.

Krulwich has also developed an application for Notes that applies significant-phrase extraction and inductive learning techniques to routinely gather new documents (Krulwich 1995). Using an agent to represent an individual user's interests is the core concept of this system. These agents extract significant phrases from the documents a user interacts with and then exchanges the learning results with each other anonymously. This enables user to the significant phrases that the agents extract, without sharing their own particular feelings about the relevance of a phrase.

4.3.3 GroupLens

Another important early Collaborative Filtering system for internet information was GroupLens (Resnick et al. 1994). It provided a netnews-reading client that lets readers rate each message as they read it (on a scale of 1 to 5, 5 being the best). They used a series of ratings servers, (Better Bit Bureaus) to gather, predict scores, and disseminate the ratings when requested. Once again, this system is built on a simple premise "the heuristic that people who agreed in the past will probably agree again".

GroupLens improves on Tapestry by using a distributed system that enables new clients or rating servers to use evaluations in new and different ways. GroupLens also improves on its queries in that a reader does not need to know in advance whose ratings to use and in fact these ratings might even be aggregated ratings. This level of privacy seems crucial when implementing a Collaborative Filtering system outside of a company's boundaries.

Ratings are recommended to a user by compiling correlating ratings (using Pearson r coefficients) on previous articles to determine weights to assign each of the other users. (Presumably, these weights are then measured and a range of neighboring users is selected.) Next, the weights are used to combine the ratings that are available for the current article. By having many different Better Bit Bureaus, each with ratings for a certain type of article (most likely organized by newsgroup), this comparison is relatively fast. Combine this with the customization that a news client has to accept ratings, and the system can provide some utility to a user.

Even though GroupLens is fairly old for a Collaborative Filtering system, its development is continuing at the University of Minnesota. Improvements have include adding capabilities for various news readers, supporting a larger (therefore more accurate) number of Better Bit Bureaus for each type of netnews, and a current experiment--MovieLens to rate and get movie recommendations.

More important than the functionality of these early Collaborative Filtering systems, they have provided a firm foundation to improve upon. The statistical analysis alone that these systems worked from has improved many times over since their debuts. To increase the number of potential users, systems are undoubtedly moving to work with Web documents or at least allow Web browsers as reader clients (as with MovieLens). This leads us into examining Web-based Collaborative Filtering systems.

4.4 Collaborative Filtering on the World Wide Web

Belkin suggests that Collaborative Filtering is a new implementation of Information Retrieval. What makes it different from Information Retrieval is that feedback is more important and aggregate feedback (how the information retrieved is used) among groups of users is measured as well (Belkin and Croft 1992).

One great feature of the Web can also the worst enemy to Web information organizers--the Web changes so quickly, it's like trying to organize the objects flying around in a tornado. Collaborative Filtering can help us lead each other through the data. It is hard for the organizers of any Web information to know how the system will be used and filtering may help users find the more important information easier.

4.4.1 Influential Web Collaborative Filtering Systems

There is a whole group of tools used by Web users that support at least some of the features of a Collaborative Filtering system. Some of these, through marketing or being there first, have achieved significant popularity and support a number of users. Mosaic

The first Web tool that facilitated collaboration was also the first graphical Web browser, Mosaic. Developed at the University of Illinois Urbana-Champaign, this browser provided the ability to let users publish and distribute notes (including links to other Web pages) as comments added to Web pages. This simple feature (not used in the mainstream browsers by Netscape and Microsoft today) enabled users to actively share information with others. Since each comment was added manually to a Web page, this feature never caught on. However, the ability to add bookmarks (overviewed earlier in this paper and considered a valuable resource) has survived as other Web browsers have developed. Firefly

Another more recent set of services for Collaborative Filtering on the Web is provided by the Firefly Network, Inc.

Initially built around the application Ringo (previously known as HOMR) which provides a " technique for making personalized recommendations from any type of database to a user based on similarities between the interest profile of that user and those of other users" (Shardanand 1995).

Ringo determines which users have similar taste via standard formulas for computing statistical correlations. It uses what the authors call social information filtering described as:

"Social Information filtering exploits similarities between the tastes of different users to recommend (or advise against) items. It relies on the fact that people's tastes are not randomly distributed: there are general trends and patterns within the taste of a person and as well as between groups of people. Social Information filtering automates a process of ´´word-of-mouth'' recommendations. A significant difference is that instead of having to ask a couple friends about a few items, a social information filtering system can consider thousands of other people, and consider thousands of different items, all happening autonomously and automatically. The basic idea is:

1.The system maintains a user profile, a record of the user's interests (positive as well as negative) in specific items.

2.It compares this profile to the profiles of other users, and weighs each profile for its degree of similarity with the user's profile. The metric used to determine similarity can vary.

3.Finally, it considers a set of the most similar profiles, and uses information contained in them to recommend (or advise against) items to the user."

These profiles are compared using a "constrained Pearson r algorithm" (similar to GroupLens I assume since neither system's formulas are shown) to make the best predictions between users.

Since this application, Firefly has expanded its technology from recommending music and movies to creating and individual newspages, recommending books, noting software files to download, and Web pages. These features are managed by a Firefly Passport, a central set of preferences managed by the Firefly Network that can be used with a Firefly application on another Web site. Firefly also has sought to establish a community of users around the idea of trusted neighbors who have the same likes as each other. The Firefly Web site provides homepages, messaging, notifications when friends are also online, and a wallet full of electronic commerce settings for buying items from the participating Web sites.

However, due to the complex nature of these topics (either in quantity and complexity of description), there has not really been the breakthrough success of the first Firefly system. Yahoo!

Services like Yahoo!, are doing Collaborative Filtering of Web sites the manual way. They use a staff of catalogers that use tools to update the Yahoo! index as quickly as possible. "Every site added to Yahoo! is examined by a human being." (Callery 1996) There is also help from people at the other end of the hierarchy, as Web users are invited to submit recommended pages and make a guess as to their subject category. Because of the popularity of the open, adaptable form of the Yahoo! index, it has become the de facto standard for classification on the Internet. By accepting contributions from Web users, Yahoo! keeps itself aligned with the shifting changes of Web sites as well.

4.4.2 General Web Collaborative Filtering Systems

There are many other Collaborative Filtering systems in use on the Web in various stages or research or development. This section covers systems that present a different viewpoint or technique that can be added to the body of knowledge about Collaborative Filtering systems. Siteseer

One such system, Siteseer (Rucker and Polanco 1997) uses the bookmark files a Web user creates to recommend new Web pages. Not only are the individual bookmarks used by the Siteseer system, but the organization of bookmarks within folders (if any) are also used for predicting and recommending relevant Web pages. Using bookmarks as the data source directly shows at least some level of interest in the underlying content of the Web page. The different groupings or folder of bookmarks can be considered as a personal classification system for the user (as noted in (Abrams 1997)) which enables Siteseer to contextualize recommendations in the classes defined by the user.

Siteseer's data source and techniques are easy to agree with as useful ways to recommend Web sites. However, I can see the process being improved by also noting other information in the bookmark file, such as the date the site was first visited and the last time it was visited. Such additional information about a user's Web browsing interests would enable systems like Siteseer to focus on more relevant Web sites and help prioritize the filtering process. PHOAKS

A new Collaborative Filtering system attracting notice and use on the Web is the PHOAKS (People Helping One Another Know Stuff) system developed by Teveen, et. al. (Teveen et al. 1997). They take an interesting approach that uses newsgroup messages that recommend Web resources. PHOAKS works by automatically recognizing Web resource references in a newsgroup message and then attempts to classify it (significant indicators are the subject of the message and the newsgroup it is posted in), then introduces it into the tallying process. Various rankings (top 20 mentioned, etc.) are then distributed.

Arguably, the idea of getting recommendations from another internet source is a good idea as Web resources can be rapidly gathered from a pool of abundant, up-to-date messages. However, to achieve any kind of accuracy, some "kill" list of message titles, senders, and Web resources mentioned in signatures must be used to eliminate netnews noise or over-zealous advertising. Fab

Marko Balabanovic (one of the most prolific Collaborative Filtering system developers) and Yoav Shohamahev at Stanford developed Fab (Balabanovic and Shoham 1997) , one of the Web's most-popular Collaborative Filtering systems. By combining both collaborative and content-based filtering systems, Fab seeks to eliminate some of the weaknesses found in each type of system.

Implementing content-based filtering to locate similar Web pages with collaborative-based filtering to note Web pages a user liked can overcome many problems such as the difficulty of obtaining recommendations on new Web pages and changing user interests. Fab updates a user's profile based on the content filtering and uses a seven-point scale to enable a user to further rank the recommended Web pages. In dealing with the problem of the "cold-start" of Collaborative Filtering systems where it takes a user significant time to rank enough resources to prove useful, the content-based construction of the profiles is important to Fab's success.

Fab uses a series of programs (which are called "agents") that collect Web pages for a topic, select pages for a particular user, search the Web for pages that match a particular profile type, and index by querying Web search engines and personally organized sites.

The strength of such hybrid systems like Fab will undoubtedly influence future Collaborative Filtering systems. Not only does Fab strike a balance that makes it useful, it also minimizes Web resources by not flooding the Web with masses of queries and uses "off-the-shelf", existing Web resources to help organize and locate Web pages. Grassroots

Kamiya, Röscheisen, and Winograd (Kamiya, Röscheisen, and Winograd 1996), again at Stanford, developed Grassroots, a system that makes Collaborative Filtering only a part of a more ambitious system. They describe Grassroots as "A System Providing A Uniform Framework for Communicating, Structuring, Sharing Information, and Organizing People".

Their argument is that a more holistic system is needed to truly capture all of a user`s information filtering needs.

"People keep pieces of information in diverse collections such as folders, hotlists, e-mail inboxes, newsgroups, and mailing lists. These collections mediate various types of collaborations including communicating, structuring, sharing information, and organizing people. Grassroots is a system that provides a uniform framework to support people's collaborative activities mediated by collections of information. The system seamlessly integrates functionalities currently found in such disparate systems as e-mail, newsgroups, shared hotlists, hierarchical indexes, hypermail, etc."

Grassroots provides a uniform interface of Web pages to access all of the information it works with. Realistically, Grassroots also lets participants continue using other mechanisms, and Grassroots takes as much advantage of them as possible. The main engine behind Grassroots is a Web server and Proxy server setup that can be used with any Web browser.

Grassroots features a news inbox that supports read, write, and post protection and authentication. This enables a form of decentralized, ad hoc newsgroups or mailing lists that can "compete" for a user's attention. With tracking via the servers, the popular groups or lists with a lot of randomness can be noted by the user.

Integrated or total information management systems like Grassroots are bound to be experimented with, especially in corporate or intranet environments. Some of these concepts are already used in systems like Lotus Notes. Grassroots` advantages are its Web-based interface, Web server control mechanisms, and adaptability in forming and changing grouping of information and user participation--all trends future Collaborative Filtering systems will follow.

5. Conclusion

This section concludes the research overview and summarizes the general ideas in the paper.

The previous sections of this paper have covered a wide variety of topics:

  1. Information Seeking models and studies that provide insight into how people browse information resources,
  2. Bibliometrics, a set of techniques to analyze information referencing and use,
  3. The Internet and World Wide Web, as a set of protocols and applications that give access to the largest information system in the world, and
  4. Filtering and Collaborative Filtering, approaches and studies that attempt to prioritize and recommend sources for Information Seeking.

This section attempts to show that combining the knowledge of these areas can provide the tools necessary to augment an individual's Information Seeking on the World Wide Web.

Everyday, the Internet proves that it is the whole that is greater than the sum of its parts. Millions of users participate in the World Wide Web by publishing, reading, searching, learning, recommending, and directly interacting with each other--forming a community. If only the smallest fraction of this variety and volume of behavior can be utilized, it can transform the Web into a even more personal and participatory environment.

Douglas Engelbart began the work to augment man's intellect through the use of artifacts, language, and training (Engelbart 1963). With the World Wide Web, we finally have information processing artifacts that can actively tell us about their use. Engelbart proposes a "systems-engineering approach" to design artifacts and people in step with one another. By using the frameworks of study reviewed in this paper, we more readily approach this goal of designing systems in step with people.

These ideas emerge from this survey of the literature:

  1. Research in Information Seeking supports that Information Seeking on the Web may be understood as a finite number of activities and phases, providing a useful framework for designing data collection and analyzing data from empirical studies.
  2. Research in Bibliometrics provides us with a set of laws and techniques applicable to Information Seeking usage data in order to uncover underlying patterns and clustering behaviors.
  3. The browsing and serving of Web pages generate a variety of data and information resources such as bookmark files, history files, and server logs. These resources can be individually and collectively analyzed or processed to provide insights about the information needs and preferences of a group of users.
  4. Research in Collaborative Filtering suggests a number of information filtering strategies that allow of group of users with related information needs to collectively augment their information seeking. In effect, Collaborative Filtering enables qualitative, human-derived insight to be obtained from quantitative, machine-generated data.In conclusion, this paper suggests that a promising avenue of augmenting Information Seeking on the World Wide Web is presented by using the results and tools from research in online information seeking, web-based Bibliometrics, Web browsing and serving, and Collaborative Filtering.

6. Suggested Research Projects

This section proposes two research projects, both designed to answer questions about improving Information Seeking on the World Wide Web.

I will attempt to use "off-the-shelf" tools, resources, and data types as much as possible to allow more focus on analysis and improving models of Information Seeking behavior on the Web.

6.1 Research Problem A

There is a need for tools that can adapt to user's needs and preferences, and to help them cope with the volume of information on the World Wide Web by providing Web users with new, organized, and relevant Web resources that can make Information Seeking tasks easier.


As Jakob Nielsen succinctly points out: "Web surfing is dead" (Nielsen 1997). Once the initial thrill of surfing the Web is over, most users settle down to relying on a small number of sites for information. Of course, they occasionally look at other Web sites, but for most Web users, the days of exploration on the Web are soon over.

One contributing factor to the small number of Web sites a user frequently visits is that the bookmark functions of most browsers are inadequate. Tauscher found that Web users usually only begin to organize their bookmarks when they can no longer see all of them on the screen (Tauscher and Greenberg 1997). When users do organize their bookmarks, they often still have trouble recalling what a bookmark is actually describing or their folders (sub-menus) are not labeled well. With the advent of frames and dynamically-generated Web pages, some bookmarks may not even refer to the information the user intended to remind herself of.

We know from Information Seeking research that users need to distinguish and monitor information. Bookmarks can help provide these features when using the World Wide Web as an information source. Based on these ideas, specific focus on making these activities easier should be investigated.

Clearly, new ideas in acquiring, sorting, archiving, and naming bookmarks are needed as well. Yet another area rich in potential is the direct sharing of bookmarks between users--a conceptual pool of bookmarks available to a group of Web users (a research group or entire corporate Intranet of users).

6.1.2 Research Questions

Could adaptable bookmarking tools improve Web Information Seeking? How would the bookmark files of many Web users be gathered and shared? What kind of technology is best to store, organize, and analyze bookmarks? How can a user's privacy be protected when sharing bookmarks with others?

6.1.3 Discussion

Most Web browser users do not pay much attention to the data that reside in their bookmark files. Most Web browser users also do not use the capabilities of their Web browser when dealing with bookmarks (e.g. most Web browsers released in the last year provide a function that can check to see if a bookmarked page has been updated since the last visit). A study of the needed technology to make bookmarking more useful in Information Seeking is in order.

Data can be collected from a Web browsing session is difficult to retrieve if it is client-based over a network (i.e. not from a Web server). Bookmark and history files can provide much of this information. Establishing a service that lets Web users share each others` bookmarks could be implemented and require that participants upload their bookmark and history files. Suggested methods to acquire these files include a new feature supported by some web servers: client-side upload. Emailing the files to the service administrator is also an option.

We learned from Abrams' (Abrams 1997) study of bookmarking that users utilize bookmarks to make sense of the complexity of the Web. A more useful bookmarking tool should therefore also help a Web user comprehend the structure of the Web (or at least certain Web sites) and possibly even improve his mental model of the Web.

At a technical implementation level, it would be possible to implement a series of methods that use existing technology on the Web to help classify bookmarks. For example, it might be helpful to automatically organize a user's bookmarks based on the Yahoo! classification system. Even more importantly, these techniques can be described and serve as a foundation for other researchers to use when developing Web-centric applications.

The trend (now somewhat past ) of building as part of a homepage(s), favorite links list, is a boon to Collaborative Filtering. By collecting bookmark files and these pages of links, a study could be undertaken to analyze the URLs and look for trends among the owners.

It may also be possible to collect all of the sources of a Web user's bookmarks and perform Bibliometric-like analysis on them to unearth even more data on similarities among Web pages and users.

6.1.4 Method

A design of a new bookmarking tool would require a review of the available tools and techniques used to work with bookmarks, and possibly other Web resources. A historical review of hypertext systems with similar functionality much like Lee's review of history tools (Lee 1992) could be undertaken in order to provide a listing of feature present in current bookmaking tools. This information could then be compared to features actually used which should suggest new techniques or how to make current techniques more usable.

A method for collection of bookmark and history files could be to use new technical capabilities of Internet applications (such as client-side upload or JavaScript functions with browser cookie files) to enable a server to automatically request these files from the Web browser.

Using techniques from Collaborative Filtering to compare bookmark data. One technique, adapted from the Fab system, would be to use a utility to extract bookmark URLs from the bookmark file and use existing search engines (to find different users pages of bookmarks) and Yahoo! (to discover a bookmark's classification) to look for other sites that also reference the URL. Bibliometric studies suggest that these sites might also have other information of interest to the user who bookmarked the original page.

Bookmarks could also be compared with history files to determine which sites are visited most often and establishing new categories based on these visits like the Tapestry system uses endorsements to indicate interest. Essentially, the history file shows how much interest there is in a bookmark (by showing the number of visits), and frequently-visited sites can then be noted as having much interest.

Developing any new application would require usability testing. For usability testing , the sample sizes and techniques advocated by Jakob Nielsen's "Discount Usability Engineering" (Nielsen 1989) are suggested. Nielsen's methodology uses a small number of carefully-selected, representative users who provide the maximum breadth of user experience. They are observed using the application, followed by subsequent interviews to review the session to discover any un-expressed impressions about the application being tested.

Another approach that could be used for interviewing is critical incident method, where the user is asked to describe in detail key moments in the process. Using a log file that would gather a user's actions while using the system, the user could then be walked through her session, asking her to explain what happened at certain points in the testing process.

6.1.5 Expectations

With reference to user satisfaction of a new type of bookmarking tool, it is expected that acceptance would be high due to the need for such a tool. However, if the tool is not easily available or hard to coordinate with the user's regular Web browser, long term use is not expected. This makes the selection of the development platform and implementation more important than initially thought. At worst, good user response of the expanded feature set of a new bookmark management tool could provide useful functionality to build into a more robust application.

It is also expected that a new appreciation for bookmark files as a rich source of data describing Web Information Seeking behavior would develop among the users of the system, this author, and the research community at large.

6.1.6 Significance of the Research

There are relatively few robust tools that work with bookmarks. Even the commercially available tools seem to miss the richness of data a bookmark file provides. Focusing on bookmarks in a study like this should improve the use of any bookmarking tool and establish bookmarks as good sources to data about a user's Information Seeking process.

Aside from the added functions of bookmarking tool, some practical data can be extracted from the process on how to make design decisions in implementing a tool that is easy to access (its availability as a client or server tool, online or offline tool), uses an open set of technical standards (Java-based or a customized Web site/page for each user), and its underlying data format (use existing the bookmark format or expand it).

During the study of creating and testing such a tool, new information about the specific Information Seeking behavior of World Wide Web users could also form the basis for future Information Seeking research.

6.2 Research Problem B

Designing a Collaborative Filtering application to use a community of Web user's input to recommend new Web sites.

6.2.1 Introduction

Finding information on the World Wide Web is like searching through a library with only a few books having titles on their spines--most of the information available is difficult to uncover. We know from Information Seeking models that a user needs help in surveying and browsing through information. Therefore, guides (much like the informal networks studied in Environmental Scanning) are needed that can recommend titles based on our expressed and implied interests and that can provide relationships between titles.

Many of the Collaborative Filtering systems reviewed in the paper propose useful techniques to build, compare, distribute, and maintain filtering recommendations for a group of participating users. However, in almost all cases, the systems are either proprietary, depend on a pre-selected group of users (a corporation's users, for example), or required a initial period where the suggestions are either ranked or gradually tailored to the user. All three of these problems contribute to making Collaborative Filtering systems either less than useful considering the tradeoffs or not used at all after a brief honeymoon with the system.

A public and therefore more easily received and used Collaborative Filtering system would need to capitalize on existing Web technologies and take advantage of Web standards (such as P3P and the Open Profiling Standard) to meet a user's privacy concerns. Using and setting the PICS (the Platform for Internet Content Selection) information about a site could also provide some insurance of privacy, making a user more comfortable providing her data to the Collaborative Filtering system.

6.2.2 Research Questions

Can we circumvent the cold start problem of Collaborative Filtering systems by using a Web user's initial set of bookmarks to recommend Web sites? Are a set of default user profiles needed to begin recommending Web sites to a user? What server capabilities would be useful in gathering data to build a user profile?

6.2.3 Discussion

Since I do not want to explore algorithms to unearth results, I plan to use as many existing Internet indexing resources as possible. Internet search engines, the Yahoo! Web index, and existing relational database technology (to store data and build reports that perform calculations on the data tables) will be used whenever possible--I do not want to re-invent the Collaborative Filtering wheel.

One area where little work has been done is the interface design and metaphors used in Collaborative Filtering systems. It might be very useful to make this a significant part of the research project, not only advance collaborative interface and metaphor development, but to discover if these types of "non-technical" improvements can make a system more useful or more frequently used by a user population. For example, interface controls and screen metaphors could be designed like the image of a big-city newspaper. Each user is a reporter or editor specializing in a newsbeat or even simply as a user "reading the paper" full of customized links to Web resources. Individuals belong to an aggregate group(s) (reporters on the "technology beat" or "sports fans") and can work in those roles while using the associated type of information. This is similar to viewing publishing as a distributed process, characterized by the cooperation of different experts (Aberer, Boehm, and Hueser 1994). In this case, the experts are not publishers, but the content experts themselves. Users self-edit and self-select the information they want.

Additionally, some of the methods and techniques described in the review of Collaborative Filtering systems can be used in innovative ways to improve recommendation accuracy. For example, Letizia's reading method of downgrading links after a user has "read past" them or by using the HTTP REFERER log option to find out if a user only visited the recommended page or went to other pages once they reached their initial destination.

6.2.4 Method

Data collection could begin similar to the techniques present in Research Problem A. Bookmarks and history files could be collected via client-side upload or by email. This project would require the use of a dedicated server which makes gathering data via a JavaScript and Java application possible. The application could be downloaded to the user's Web browser and then ask the user`s permission to upload her history and bookmarks files. The locations of the files are known to the browser application and can be gathered from it's preferences file. The user could be shown the location of the files to be collected and then confirm their being sent.

In addition to using the server to retrieve files, data in server logs could also be collected as the server could be configured to be the gateway to Web sites recommended to the user. Log files can indicate further interest in the web page by the user: if the user follows the link to the site and then immediately returns to the previous page, the site was most likely not as useful in comparison to visiting the page and following links from it to other pages.

If the metaphor of the newspaper is followed, users could be asked to "subscribe" to this type of service and by using the Collaborative Filtering server's proxy abilities provide even more user data about a user's Web browsing behavior.

Hartson, Castillo, Kelso, and Neale propose using distance evaluation in their paper, "Remote Evaluation: The Network as an Extension of the Usability Laboratory" (Hartson et al. 1996). They argue that since "the network itself and the remote work setting have become intrinsic parts of usage patterns, difficult to produce in a laboratory setting" with the proper design, remote evaluation is possible. They also endorse the method of critical incident gathering to support the data collected remotely. With the advent of the various data collection methods described in this paper, this methodology has great potential for evaluating Web-oriented systems. I would like to pursue these methods to gather additional data about the sites a Web user visits.

To get users to utilize the service, I will borrow another set of recently developed techniques to attract users. Kotelly shows some of these new techniques in his overview of the "World Wide Web Usability Tester, Collector, Recruiter" (Kotelly 1997).

"The usability team at Wildfire Communications, Inc. conducted a usability test using the World Wide Web (WWW) as a method to advertise the test, recruit participants and gather data -- all automatically. The test was conducted over the course of only 2 days during which we collected useful information from 96 people. The usability test was for a speech system using participants recruited by Internet Newsgroups, e-mail lists and the WWW. Using these resources helped us to get a large population to test the system in a short period of time."

For internet-based applications, it only seems natural to test using the internet as much as possible. It may be possible to download a Java applet or application (much like the application developed for the Environmental Scanning Project Dr. Choo, Brian Detlor, Ross Barclay, and myself undertook this summer) that can record data on the use of the system. This is in addition to the already available methods of collecting usage data already discussed in this paper (history logs and server log information). It might even be possible to coordinate testing with CU-SeeMe applications that might capture a user's facial reactions in any given situation.

Once user participation is insured and data files collected, the data would be stored in a set of database tables. Reporting features of the database can be used to perform calculations similar to the Bibliometric techniques reviewed in this paper to locate patterns in Web site use among groups of users. This technique could also be used to build a set of user profiles that could be serve as template for new users to get them using the system quickly. Using database "triggers" that act when certain conditions are met in the database, the filtering can periodically be updated as new data is gathered.

6.2.5 Expectations

With an adequate sample size and significant use among the participants, findings should indicate that Collaborative Filtering is applicable for more complex subjects than the relatively static domains of movies and music. Using the standard structure and protocols of the World Wide Web to gather and produce data on Web users' Information Seeking behavior may influence future standards for Web data collection and analysis.

Specific results should prove that users are amenable to recommendations about additional Web sites to visit. This introduction of new Web sites may also possibly encourage more exploration of the available resources on the Web. Users also might become more inclined to adopt a gatekeeper role for their collaborative community, increasing the data that can be collected. The relative ease of recommending and visiting Web sites should also increase the participation of users more than what existing newsgroups or mailing lists that recommend resources currently evoke.

6.2.6 Significance of the Research

As the Web is ideal for developing and deploying applications (Lee and Girgensohn 1997), I will attempt to use as many open World Wide Web standards and resources to collect, analyze, and deliver information via the World Wide Web. The level of success of this study could then influence future use of Web-based systems for Collaborative Filtering.

By using Information Seeking theories to understand and categorize Web browser use and Bibliometric techniques to then provide insights into patterns of use, these areas of study might become more popular in designing Collaborative Filtering systems.

The results of the study might again, also prove that client-based data files are truly useful when attempting to uncover Information Seeking behavior on the Web and use it to predict additional information resources.

7. References

Aberer, Karl, Klemens Boehm, and Christoph Hueser. 1994. The Prospects of Publishing Using Advanced Database Concepts. Conference on Electronic Publishing, Document Manipulation and Typography, at Darmstadt, Germany.

Abrams, David and Baecker, Ron. 1997. How People Use WWW Bookmarks. SIGCHI, at Atlanta, Georgia.

Abrams, Marc and Williams, Stephen. 1996. Complementing Surveying and Demographics with Automated Network Monitoriing. World Wide Web Journal 1 (3):101-120.

Agapow, Paul-Michael. 1994. Bootstrapping Evolution with Extra-Somatic Information. In Complex Systems: Mechanisms of Adaptation, edited by R. J. S. a. X. H. Yu. Amsterdam: IOS Press.

Aguilar, Francis J. 1967. Scanning the Business Environment. New York: Macmillan Co.

Allen, T. J. 1977. Information needs and uses. In Annual Review of Information Science and Technology. Chicago: Encyclopedia Britannica.

Arthur, W. Brian. 1988. Self-Reinforcing Mechanisms in Economics. In The Economy as an Evolving Complex System, SFI Studies in the Sciences of Complexity: Addison-Wesley.

Belkin, Nicholas J., and W. Bruce Croft. 1992. Information Filtering and Information Retrieval: Two Sides of the Same Coin? Communications of the ACM 35 (12):29-38.

Belkin, Nicholas J., Pier Giorgio Marchetti, and C.olleen Cool. 1993. BRAQUE: Design of an Interface to Support User Interaction in Information Retrieval. Information Processing and Management 29 (3):325-344.

Belkin, N. J., R. N.Oddy, and H. M. Brooks. 1982. ASK for Information Retrieval: Part I. Background and history. Journal of Documentation 38 (2):61-71.

Berners-Lee, T., et. al. 1992. World-Wide Web: The information universe. Electronic Networking: Research, Applications and Policy 2 (1):52-58.

Berners-Lee, T. 1994. The World-Wide Web. Communications of the ACM 37 ( 8):76-82.

Brookes, B. C. 1973. Numercial Methods of Bibliographic Analysis. Library Trends:18-43.

Callery, Anne. 1996. Yahoo! Cataloging the Web. Untangling the Web, April 26, at University of California, Santa Barbara.

Catledge, Lara D., and James E. Pitkow. 1995. Characterizing Browsing Strategies in the World-Wide Web. Computer Networks and ISDN Systems 27:1065-1073.

Cherry, J. M. 1997. Feedback on "Bibliometrics and the World-Wide Web": a paper for Advanced Seminar in Research Methodologies, LIS3005Y.

Choo, Chun Wei, and Ethel Auster, eds. 1993. Environmental Scanning: Acquisition and Use of Information by Managers. Edited by M. E. Williams. Vol. 28, Annual Review of Information Science and Technology. Medford, NJ: Learned Information, Inc.

Cook, S., and G. Birch. 1991. Modelling groupware in the electronic office. International Journal of Man Machine Studies 34 (3):369-394.

Cooper, I. H. 1994. WWW Logfile Analysis Project: http://althea.ukc.ac.uk/Projects/logfiles.

Csikszentmihalyi, Mihaly. 1991. Literacy and Intrinsic Motivation. In Literacy: An overview by fourteen experts, edited by S. R. Garubard. New York: Cambridge Univsersity Press.

Diodato, Virgil. 1994. Dictionary of Bibliometrics. New York: The Haworth Press.

Downie, Stephen J. 1996. Informetrics and the World Wide Web: a case study and discussion. Canadian Association for Information Science, June 2-3, at University of Toronto.

Drott, M. C. 1981. Bradford's Law: Theory, Empiricism and the Gaps Between. Library Trends Summer (Special Issue on Bibliometrics):41-52.

Ehrlich, K., and D. Cash. 1994. Turning Information into Knowledge: Information Finding as a Collaborative Activity. Digital Libraries '94, at College Station.

Ehrlich, Kate; Lee, Alex; Ritari, Philio. 1995. Hypertext Links In Lotus Notes And The World Wide Web. Cambridge: Lotus Development Corporation.

Ellis, David. 1989. A Behavioural Approach to Information Retrieval Systems Design. Journal of Documentation 45 (3).

Ellis, David. 1991. New horizons in information retrieval. London: Library Association.

Ellis, David. 1996. Progress and Problems in Information Retrieval. London: Library Association.

Ellis, David. 1997. Modelling the information seeking patterns of engineers and research scientists in an industrial environment. Journal of Documentation 53 (4):384-403.

Engelbart, D. 1963. A conceptual framework for the augmentation of man's intellect. In Vistas in Information Handling, edited by P. Howerton. Washington, DC: Spartan Books.

Fidel, Raya. 1984. Online searching styles: a case-based model of searching behavior. Journal of American Information Science 35 (4):211-221.

Fischer, Gerhard, and Curt Stevens. 1991. Information Access in Complex. Poorly Structured Information Spaces. SIGCHI.

Gaines, Lee Li-Jen and Briand R. 1997. Knowledge Acquisition Processesin Internet Communities. In http://ksi.cpsc.ucalgary.ca/KAW/KAW96/chen/ka-chen-gaines.html. Alberta: Knowledge Science Institute, University of Calgary.

Goldberg, David, David Nichols, Brian M. Oki, and Douglas Terry. 1992. Using Collaborative Filtering to Weave an Information Tapestry. Communications of the ACM 35 (12):61-70.

Hartson, H. Rex, Jose C. Castillo, John Kelso, and Wayne C. Neale. 1996. Remote Evaluation: The Network as an Extension of the Usability Laboratory. CHI '96, at Vancouver.

Hill, Will, Larry Stead, Mark Rosensteind, and George Furnas. 1995. Recommending and Evaluating Choices In A Virtual Community of Use. CHI '95, at Denver.

Huberman, Bernardo A., Peter L.T. Pirolli, James E. Pitkow, and Rajan Lukose. 1997. Strong Regularities in World-Wide Web Surfing: currently unpublished, given by author Pitkow.

Kamiya, Kenichi, Martin Röscheisen, and Terry Winograd. 1996. Grassroots: A System Providing A Uniform Framework for Communicating,

Structuring, Sharing Information, and Organizing People. Sixth International World Wide Web Conference, at Paris.

Kehoe, Colleen and Pitkow, James E. 1996. Surveying the Territory: GVU's Five WWW User Surveys. World Wide Web Journal 1 (3):77-84.

Kessler, M.M. 1963. Bibliographic coupling between scientific papers. American Documentation 14:10-25.

Kolodner, Janet. 1984. Retrieval and Orginisational Strategies in Conceptual Memory: A Computer Model: Lawrence Erlbaum.

Kotelly, Christopher (Blade). 1997. World Wide Web Usability Tester, Collector, Recruiter. SIGCHI, at Atlanta, Georgia.

Krulwich, Bruce. 1995. Learning user interests across hetergeneous document databases. 100 South Wacker Drive, Chicago, IL 60606: Andersen Consulting LLP.

Kuhlthau, Carol C. 1991. Inside the Search Process: Information Seeking from the User's Perspective. Journal of the American Society of Information Science 42 (5):361-371.

Lee, Alison. 1992. Investigation into History Tools for User Support. PhD, Computer Science, University of Toronto, Toronto.

Lee, Alison, and Andreas Girgensohn. 1997. Developing Collaborative Applications Using the World Wide Web "Shell": a tutorial. SIGCHI 97, March 22-27, at Atlanta.

Lieberman, Henry. 1995. Letizia: An Agent That Assits Web Browsing. International Joint Conference on Artificial Intelligence, at Montreal.

Liebscher, Peter, and Gary Marchionini. 1988. Browser and Analytical Search Strategies in a Full-Text CD-ROM Encyclopedia. School Library Miaed Quarterly Summer 1988:223-233.

Lui, Cricket , Peek, Jerry , Jones, Russ , Buus, Bryan, Nye, Adrian. 1994. Managing Internet Information Services, A Nutshell Handbook. Sebastopol: O'Reilly & Associates, Inc.

Maarek, Yoelle S. and Ben Shaul, Israel Z. 1996. Automatically Organizing Bookmarks per Contents. Fifth Internationl World Wide Web Conference, May 6-10, at Paris, France.

Maes, Pattie. 1996. Conversation with Don Turnbull at the MIT Media Lab.

Malone, T.W., K.R. Grant, F.A. Turbak, S.A. Brobst, and M.D. Cohen. 1987. Intelligent information-sharing systems. Communications of the ACM 30 (5):390-402.

Marchionini, Gary. 1995. Information Seeking in Electronic Environments. Edited by J. Long. 10 vols. Vol. 9, Cambridge Series on Human-Computer Interaction. Cambridge: Cambridge University Press.

Marchiori, Massimo. 1997. The Quest for Correct Information on thw Web: Hyper Search Engines. Sixth International World Wide Web Conference, April 7-11, at Santa Clara.

Marshakova, I.V. 1973. A system of document connection based on references. Scientific and Technical Information Serial of VINITI 6 (2):3-8.

Meadow, Charles T., Jiabin Wang, and Manal Stamboulie. 1993. An Analysis of Zipf-Mandelbrot Language Measures and Their Application to Artificial Languages. Journal of Information Science 19 (4):247-258.

Miller, Jim, Paul Resnick, and David Singer. 1996. Rating Services and Rating Systems (and Their Machine Readable Descriptions). Cambridge, MA: World Wide Web Organization.

Mogul, J., and P. J. Leach. 1997. Simple Hit-Metering for HTTP: HTTP Working Group.

Monge, Alvaro E., and Charles P. Elkan. 1996. The WEBFIND tool for finding scientific papers over the worldwide web. AAAI Fall Symposium: AI Applications in Knowledge Navigation and Retrieval, at Montreal.

Moravcsik, M.J. and Murugesan, P. 1975. Some results on the function and quality of citation. Social Studies of Science 5:86-92.

Morita, M., and Y. Shinoda. 1994. Information Filtering Based on User Behavior Analysis and Best Match Text Retrieval. 17th Annual SIGIR Conference on Research and Development.

Nardi, B.A., and J.R. Miller. 1991. Twinkling lights and nested loops: Distributed problem solving and spreadsheet development. International Journal of Man Machine Studies 34 (2):161-184.

Netscape, Communications Corporation. 1997. Using Netscape Proxy Server. Mountain View, CA: Netscape Communications Corporation.

Nielsen, Jakob. 1989. Usability engineering at a discount. In Designing and using human-computer interfaces and knowledge based systems, edited by G. Salvendy and M. J. Smith. Amsterdam: Elsevier Science Publishers.

Nielsen, Jakob. 1997. User Interface Design for the WWW. SIGCHI, at Atlanta, Georgia.

Pitkow, James. 1997. In Search of Reliable Usage Data on the WWW. Sixth International World Wide Web Conference, April 7-11, at Santa Clara.

Pitkow, James E., and Krishna A. Bharat. 1994. WebViz: A Tool for World-Wide Web Access Log Analysis. First International WWW Conference.

Pitkow, James E., and Mimi Recker. 1994a. Integrating Bottom-Up and Top-Down Analysis For Intelligent Hypertext. Atlanta: Georgia Institute of Technology.

Pitkow, James E., and Margeret M. Recker. 1994b. A Simple Yet Robust Caching Algorithm Based on Dynamic Access Patterns. Second International WWW Conference.

Price, Derek de Solla. 1976. A General Theory of Bibliometric and Other Cumulative Advantage Processes. Journal of American Society of Information Science 27 (Sept-Oct):292-306.

Resnick, Paul, Neophytos Iacovou, Mitesh Suschak, Peter Bergstrom, and John Riedl. 1994. GroupLens: An Open Architecture for Collaborative Filtering of Netnews. CSCW 94, at Chapel Hill, NC.

Resnick, Paul, and Hal R. Varian. 1997. Recommender Systems. Communications of the ACM 40 (3).

Rhodes, Bradley J., and Thad Starner. 1996. Remembrance Agent: A continuously running automated information retrieval system. The First International Conference on The Practical Application OF Intelligent Agents (PAAM '96).

Rice, James, Adam Farquhar, Phillipe Piernot, and Thomas Gruber. 1996. Using the Web Instead of a Window System. CHI '96, at Vancouver.

Rucker, James, and Marcos J. Polanco. 1997. Siteseer: Personalized Navigation for the Web. Communications of the ACM 40 (3):73-75.

Saracevic, Tefko. 1996. Modeling Interactions in Information Retrieval (IR): A Review and Proposal. 59th ASIS Annual Meeting, at Baltimore.

Saracevic, Tefko, and Paul Kantor. 1988a. A Study of Information Seeking and Retrieving. II. Users, Questions, and Effectiveness. Journal of American Society for Information Science 39 (3):177-196.

Saracevic, Tefko, and Paul Kantor. 1988b. A Study of Information Seeking and Retrieving. III. Searchers, Searches, and Overlap. Journal of American Society for Information Science 39 (3):197-216.

Saracevic, Tefko, Paul Kantor, Alice Y. Chamis, and Donna Trivison. 1988. A Study of Information Seeking and Retrieving. I.Background and Methodology. Journal of American Society for Information Science 39 (3):161-176.

Schiminovich, S. 1971. Automatic Classification and Retrieval of Documents by Means of a Bibliographic Pattern Discovery Algorithm. Information Storage and Retrieval 6:417-435.

Shakes, Johnathan; Langheinrich, Marc; and Etzioni, Oren. 1996. Dynamic Reference Sifting: A Case Study in the Homepage Domain. WWW 96.

Shardanand, Upendra and Maes, Pattie. 1995. Social Information Filtering: Algorithms for Automating "Word of Mouth". ACM SIGCHI '95, at Vancouver.

Small, H. 1973. Co-Citation in the Scientific Literature: A New Measurement of the Relationship Between Two Documents. Journal of the American Society of Information Science 24 (4):265-269.

Spertus, Ellen. 1997. ParaSite: Mining Structural Information on the Web. Sixth International World Wide Web Conference, April 7-11, at Santa Clara.

Stout, Rick. 1997. Web Site Stats: Tracking hits and Analyzing Traffic. Berkeley: Osborne/McGraw-Hill.

Tauscher, Linda, and Saul Greenberg. 1997. How people revisit web pages: empirical findings and implications for the dsig of history systems. International Journal of Human-Computer Studies 47:97-137.

Tauscher, Linda M. 1996a. Design Guidelines for Effective WWW History Mechanisms. Microsoft Workshop, Designing for the Web: Empirical Studies, October 30, 1996, at Redmond, WA.

Tauscher, Linda M. 1996b. Supporting World-Wide Web Navigation Through History Mechanisms. ACM SIGCHI 96 Workshop Position Paper.

Teveen, Loren, Will Hill, Brian Amento, and Josh Creter. 1997. PHOAKS: A System for Sharing Recommendations. Communications of the ACM 40 (3):59-62.

Trigg, Randall. 1983. A Network-Based Approach to Text Handling for the Online Scientific Community. Ph.D., Department of Computer Science, University of Maryland, College Park, MD.

Voos, H. 1974. Lotka and Information Science. Journal of the American Society of Information Science 25:270-273.

Williamson, C., and Ben Shneiderman. 1992. The Dynamic HomeFinder: Evaluating dynamic queries in a real-estate information exploration system. SIGIR `92, at Copenhagen.

Witten, I.H., H.W. Thimbleby, G. Coulouris, and S. Greenberg. 1991. Liveware: A new approach to sharing data in social networks. International Journal of Man Machine Studies 34 (3):337-348.

Woodruff, Allison; Aoki, Paul M.; Brewer, Eric; Gauthier, Paul; Rowe, Lawrence A. 1996. An Investigation of Documents from the Wordl Wide Web. 5th International WWW Conference (WWW5), at Paris.

Wulf, Volker. 1996. Introduction: The Autopoietic Turn in Organization Science and its Relevance for CSCW. SIGOIS Bulletin 17 (1):2-3.

Wulfekuhler, Marilyn R. and Punch, William F. 1997. Finding Salient Features for Personal Web Page Categories. Sixth International World Wide Web Conference, April 7-11, at Santa Clara.

Yan, Tak Woon, Matthew Jacobsen, Hector Garcia-Molina, and Umeshwar Dayal. 1995. From User Access Patterns to Dynamic Hypertext Linking. Fifth International World Wide Web Conference, at Paris.

Zipf, G.K. 1949. Human behavior and the principle of least effort. Cambridge, MA: Addison-Wesley.

7.1 Appendix A. Typical Web Server Log

barium.ch.man.ac.uk. - - (15/Sep/1995:13:19:43 +0100) "GET /access.log HTTP/1.0"  200  3
sodium.ch.man.ac.uk. - - (15/Sep/1995:13:19:37 +0100) "GET /wwwstat4mac/wibble.hqx HTTP/1.0"  200  3026
barium.ch.man.ac.uk. - - (15/Sep/1995:13:19:43 +0100) "GET /access.log HTTP/1.0"  200  3 - - (15/Sep/1995:13:19:51 +0100) "GET / HTTP/1.0"  200  3026
tavish.te.rl.ac.uk. - - (15/Sep/1995:13:20:02 +0100) "GET /quit.sit.hqx HTTP/1.0"  200  2941
gatekeeper.spotimage.com. - - (15/Sep/1995:13:46:25 +0100) "GET / HTTP/1.0" 200  3026
gatekeeper.spotimage.com. - - (15/Sep/1995:13:46:29 +0100) "GET /httpd.gif HTTP/1.0"  200  1457
newsserver.scs.com.sg. - - (15/Sep/1995:14:03:17 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
itecs5.telecom-co.net. - - (15/Sep/1995:14:15:04 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
nw0pc11.sura.net. - - (15/Sep/1995:14:22:08 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
steuben.vf.mmc.com. - - (15/Sep/1995:15:15:07 +0100) "GET / HTTP/1.0"  200 3026
steuben.vf.mmc.com. - - (15/Sep/1995:15:15:23 +0100) "GET /httpd.gif HTTP/1.0" 200  1457
getafix.isgtec.com. - - (15/Sep/1995:15:16:34 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
steuben.vf.mmc.com. - - (15/Sep/1995:15:17:26 +0100) "GET /server.html HTTP/1.0"  200  598 - - (15/Sep/1995:15:20:10 +0100) "GET / HTTP/1.0"  200  3026 - - (15/Sep/1995:15:20:11 +0100) "GET /httpd.gif HTTP/1.0"  200 1457
aervinge.aervinge.se. - - (15/Sep/1995:15:25:53 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
swrl7.gina.slip.csu.net. - - (15/Sep/1995:15:41:29 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
tscholz-mac1.peds.uiowa.edu. - - (15/Sep/1995:16:10:24 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
pool3_2.odyssee.net. - - (15/Sep/1995:16:46:51 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
tavish.te.rl.ac.uk. - - (15/Sep/1995:17:17:19 +0100) "GET /access.log HTTP/1.0" 200  1691
tavish.te.rl.ac.uk. - - (15/Sep/1995:17:18:27 +0100) "GET /httpd4Mac-v123d9.sit.hqx HTTP/1.0"  200  46498
cgp_ws02.cegepsth.qc.ca. - - (15/Sep/1995:17:28:03 +0100) "GET Setup.HtML HTTP/1.0"  200  2412 - - (15/Sep/1995:17:56:12 +0100) "GET / HTTP/1.0"  200  3026 - - (15/Sep/1995:17:56:15 +0100) "GET /httpd.gif HTTP/1.0"  200 1457 - - (15/Sep/1995:17:58:35 +0100) "GET /about.html HTTP/1.0"  200 23297
olmstd2110.ucr.edu. - - (15/Sep/1995:18:09:51 +0100) "GET Setup.HtML HTTP/1.0" 200  2412 - - (15/Sep/1995:18:29:19 +0100) "GET / HTTP/1.0"  200  3026 - - (15/Sep/1995:18:29:25 +0100) "GET /httpd.gif HTTP/1.0"  200 1457 - - (15/Sep/1995:18:30:48 +0100) "GET /about.html HTTP/1.0"  200 23297
dialup97-007.swipnet.se. - - (15/Sep/1995:18:31:55 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
olmstd2110.ucr.edu. - - (15/Sep/1995:18:38:13 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
olmstd2110.ucr.edu. - - (15/Sep/1995:18:40:03 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
olmstd2110.ucr.edu. - - (15/Sep/1995:18:41:14 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
olmstd2110.ucr.edu. - - (15/Sep/1995:18:42:19 +0100) "GET Setup.HtML HTTP/1.0" 200  2412
olmstd2110.ucr.edu. - - (15/Sep/1995:18:44:04 +0100) "GET / HTTP/1.0"  200 3026
staff25.paragon.co.uk. - - (15/Sep/1995:19:07:26 +0100) "GET Setup.HtML HTTP/1.0"  200  2412
staff25.paragon.co.uk. - - (15/Sep/1995:19:08:19 +0100) "GET / HTTP/1.0"  200 3026
staff25.paragon.co.uk. - - (15/Sep/1995:19:08:21 +0100) "GET /httpd.gif HTTP/1.0"  200  1457 - - (15/Sep/1995:19:08:53 +0100) "GET Setup.HtML HTTP/1.0"  200 2412 - - (15/Sep/1995:19:10:59 +0100) "GET Setup.HtML HTTP/1.0"  200 2412
warthog.usc.edu. - - (15/Sep/1995:19:53:00 +0100) "GET / "  200  3026 - - (15/Sep/1995:20:10:37 +0100) "GET Setup.HtML HTTP/1.0"  200 2412 - - (15/Sep/1995:20:11:57 +0100) "GET Setup.HtML HTTP/1.0"  200 2412
00302raw.rawsen.rpslmc.edu. - - (15/Sep/1995:20:31:50 +0100) "GET /setup.html HTTP/1.0"  200  2412
00302raw.rawsen.rpslmc.edu. - - (15/Sep/1995:20:39:41 +0100) "GET /setup.html HTTP/1.0"  200  2412
00302raw.rawsen.rpslmc.edu. - - (15/Sep/1995:20:40:03 +0100) "GET /setup.html HTTP/1.0"  200  2412
warthog.usc.edu. - - (15/Sep/1995:20:52:37 +0100) "GET / "  200  3026



Most Web search engines support this feature by using a "link:" attribute followed by the URL of the initial Web page.

[2] URL - Uniform Resource Locator, the "address" of the internet resource.

[3] This state seems analogous to Csikszentmihalyi's concept of flow where a user gets caught up in the information, lost in the act of browsing where time and external influences are less important. (Csikszentmihalyi, Mihaly. 1991. Literacy and Intrinsic Motivation. In Literacy: An overview by fourteen experts, edited by S. R. Garubard. New York: Cambridge Univsersity Press)

[4] Meadow, et. al. point out that Condon first published this law, but Zipf's book Human Behavior and the Principle of Least Effort�became so popular it is primarily cited as the law's source.

[5] but not necessarily differentiating the document from others.

[6] In fairness, Lee also notes that using a history tool to expedite commands can influence the use of such tools. If we think of the Zipfian perspective of least effort, we might concede that navigation can then be biased by the ease in returning the previous sites.

[7] Prime examples of this are tools built by Maxum, found at http://www.maxum.com.

[8] An exhaustive list of external network data collection programs is available at http://www.yahoo.com/computers_internet/.

[9] There is some controversy here. Some Web sites consider a hit the whole HTML page and any included graphics or other files. Obviously, with the typical Web page including some graphics, this makes the number of actual hits per page greater than 1. This can be misleading in hit counters display the actual number of requests for data, but people assume it is the actual number of people to visit the Web site.

[10] Also note that Web clients may have different setting for their local browser cache. Once a graphic is retrieved on the client, it normally is referenced locally until it is either deleted or expires. This means that a user can view a graphics file used on more than one page, but only requests it from the Web server once.

[11] Working Draft 960323, March 23, 1996 available at http://www.w3.org. Ironically, the standard setters themselves change their web structure so much, I can't hope that a URL will lead directly to the draft.

[12] As we know from the overview of Information Seeking, a system that can help to find data must go through a much more complex series of activities. However, my goal as noted later in this paper is to do just that.

[13] I suspect the more successful Collaborative Filtering systems will actually use a combination of the two: personal filtering that is then scaled up and analyzed at the group level and then customized back to each personal point of view.