Examining the Validity of World-Wide Web Usage Statistics
© 1996 by Stuart J. Whitmore
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License (as of January 2008; see licensing note.)
The following was originally written in 1996 and was later published (in a shorter form) in the 1997 Proceedings of the Western Decision Sciences Institute. While technologies have changed since then, many of the points made in the document are still relevant, and caution should be exercised before attributing meaning to any given statistic.
The Internet has existed for decades, but only within the past few years has it become a hub of commercial activity. With the privatization of the Internet, along with the advent of graphical HyperText Transport Protocol (HTTP) browsers and the growth of the part of the Internet known as the World-Wide Web, the "cyberspace" previously enjoyed mostly by government and academic institutions witnessed the invasion of the average consumer. Where the average consumer goes, business is sure to follow, and now the World-Wide Web represents a massive, 24-hour shopping mall. While personal, academic, and other uses of the Web persist, it is virtually impossible to not notice the vast commercialization of this medium.
There is a third "party" that is sure to appear wherever business and consumers meet, and that is consumer statistics. The Web is no different, and statistics abound. Even on personal HTML documents ("home pages"), it is not uncommon to see a counter showing how many visitors have seen the page. For a commercial concern on the Web, however, those statistics mean much more than a "see what I did?" feather in the page designer's cap. As is true with any marketing, the wise business must pay heed to how effectively the online dollar is being spent, and a measure of visits ("hits") seems to offer a valid justification for the resources spent maintaining the page.
Several methods exist for determining the behavior of consumers interacting with a company's Web pages. Perhaps the simplest is a mere counter, which adds one to itself every time a particular page is requested by a browser from across the Internet. It is also possible to create user accounts (either for free or associated with a membership fee), and "cookies" (explained in more detail later) may be used to give viewers an "account number" without the viewer even realizing it. Complex programs can be written to provide an interactive page and at the same time closely track the actions of each viewer.
Unfortunately, the words of Benjamin Disraeli (according to Mark Twain) come to mind: "There are three kinds of lies: lies, damned lies, and statistics." If a company writes a check to a Web advertising firm, with assurances backed by hit statistics that the advertisement will be viewed (n) times, how can that number be verified? That question is the focus of this paper, and the apparent answer to that question may be unhappy news to more than a few businesses.
The approach taken to the research behind this paper followed two distinct paths. (Unfortunately, one of the paths was mostly a dead end.) The first, and least productive, path was an attempt to interview a variety of individuals and companies who use, generate, or otherwise affect usage statistics collection on the Web. With few exceptions, those approached with a request for an interview via electronic mail simply did not respond. Others responded with a message to the effect of "Don't call us, we'll call you." Therefore, this description of the research will focus on the avenue that provided the most information.
It was decided to construct a Web page that would be designed to collect statistics in a number of ways, with the intent of cross-verifying the numbers provided by each method. The content was designed to hopefully generate enough interest that viewers would come back on a regular basis. A brief survey was performed to determine what content would be sufficiently interesting, and based on that survey the page contained links to two features that changed daily. One was a joke, and the other was a photograph from seven categories (one for each day of the week). The page also featured a link to a discussion board, where viewers could freely add their comments.
The statistics collection methods used to measure the visitor activity were a simple counter, freely-given user accounts, and two cookies. The counter was a simple program that simply incremented its count every time the page was loaded. The user accounts were integrated with the HTTP server's document protection scheme to force users to make an account for themselves before they could access the page with the content. Creating an account was a simple and quick process that involved filling out a form online and pressing the "submit" button.
A "cookie" is an HTTP device for storing information on the client side, as written by the Web server. When a document that is designed to serve a "cookie" is loaded by a Web browser that supports cookies, the browser writes the information provided by the server to the cookie file where the browser is installed.
To a large extent, the results of the interviewing process are not worth mentioning. Some valuable input was received from an individual who read about the project in a Usenet newsgroup, and a spokesperson for the Internet Link Exchange also provided some worthwhile insight. However, most attempts at obtaining an electronic interview met with silence. This may have been a result of sending the request to the wrong person, or the "school project" status may have had a negative effect. There is also a possibility that the issue of the validity of Web statistics was seen as a "taboo" subject.
The results from the experimental Web page were much more enlightening. The raw numbers collected were:
- Counter: 302 visits counted
- Cookie ID: 170 numbers issued
- Link Exchange: 547 visits to the outer page*
- User Accounts: 86 created (approx. 150 total, estimated)**
* 11 of these resulted from a user clicking on the banner for the experimental page when it appeared on another page on the Web.
** About halfway through the project, an error in a program wiped out all user accounts before there was a chance to count them. Estimate is based on casual awareness of the approximate number of accounts created prior to the error.
Clearly, the numbers do not match very well, with the closest match between the cookie ID numbers issued and the estimated number of user accounts created. These results are analyzed in the following section, which also examines the strengths and weaknesses of various methods of statistics collection.
In this section, various methods of collecting statistics are examined, and in the process some potential explanations of the numbers observed from the experimental page are provided.
Generally speaking, there are some strengths to the current methods of tracking the behavior of Web page viewers. It is relatively easy, without programming, to implement a basic counter on a page. This offers even the neophyte writer of Web pages access to some form of statistics. By regularly observing such a counter, a general idea can be ascertained regarding the effectiveness of certain page designs and content. With more in-depth programming for user tracking, businesses can gain more information about their potential customers, and this information can be used to continue improving the page. Using cookies, it is possible to track a single person's visit as they move throughout a site, and this indirect surveillance can give valuable feedback regarding the popularity of the various features on the site.
However, there are a large number of weaknesses in current Web statistics, and to effectively use such statistics the "holes" must be made visible. The weaknesses can be divided into the following categories:
1. Lack of standards
2. Technical weaknesses
3. Viewer behavior
Each of these categories is examined in more depth in the following paragraphs.
Lack of Standards
There are currently no Web-wide standards regarding how statistics are to be collected. This means that one site may count every time a page is requested from a browser as a "hit" while another site may filter requests so that it counts no more than one every five minutes from the same requesting address. Without standard counting methods, each counter programmer can decide what seems appropriate, but that information is not passed along to those who see and use the counter.
In addition, there are no standards for exactly what is being measured. One business may want to know how many times their page is loaded, period. Another may want to know how many individual people see their page, with no interest in how many times a given person sees it. If a Web advertising firm "guarantees" that a page will be get (n) hits for a certain price, there is no standard to decipher exactly what that means, and they may not volunteer the information.
Aside from the lack of standards, most problems with Web statistics come about from technical issues related to the environment in which the Web works. The HTTP method of serving Web documents is primarily a Request/Send protocol. A browser requests a document, and (if the document exists) the server sends it. The server cannot preemptively send a document to a browser; the browser must initiate the action in some form. In addition, the server typically responds in one form or another to every request it receives. HTTP servers are not designed to receive information, only to send it.
Part of the Web environment, however, is temporary storage locations known as caches and proxies. These storage locations are designed to reduce the amount of data being transmitted, by keeping a local copy. A cache is used by an individual browser to store recently-retrieved documents, and a proxy is generally intended to serve a number of users by keeping popularly-visited sites locally and serving them to local users rather than going across the Web to get the documents. When a browser obtains a document from its individual cache or from a proxy, the original server never knows about it.
This brings about one of the problems with simple counters. A counter program on the HTTP server cannot increment its count by one for a document loaded from a cache or proxy, because it is completely unaware of that load. The server cannot be told about the load, because the cache or proxy is not designed to transmit that information, and even if it did, the server wouldn't know what to do with it. Since a proxy (and a cache on a computer shared by multiple users) will serve a document to multiple people, the counter is understating the actual exposure of the page.
There are other problems with counters. A poorly designed counter could potentially increment itself for each discrete item on a single page. For example, if a page uses three graphic images, the faulty counter could count each page load as four (one for the page, three for the graphics on the page). Also, if the counter is simply designed to count each time the page is loaded, a single user can hit their "reload" button several times to increase the count. In this case, the counter is overstating the actual exposure of the page. (Indeed, one viewer of the experimental page for this project admitted to hitting the reload button until the counter was over 100, because it looked too "sad" below 100.)
Cookies are also defeated by caches and proxies. They also have a multitude of other problems. Not all Web browser software supports cookies, so any count based on cookies (such as the experimental page's count of user ID numbers) will understate the actual activity on the page. In addition, cookie files are stored with the browser software, which means that the same person accessing the cookie-serving document from two locations will have two cookies. This overstates the document's activity.
User accounts are not as susceptible to the cache/proxy problem, but they have their own unique set of problems. If an account requires payment, it will certainly give the business an exact count of how many people have purchased accounts – but it will also discourage those who don't want to pay and it will not take into account the potential for users to share accounts. If an account does not require payment (as was the case with the experimental page for this project), users can give themselves multiple accounts, which may be created with fictitious information out of self-protection or malicious intent. On the other side of the coin, users may be reluctant to provide any information, even for a free account. This may be why the Internet Link Exchange count for visits to the "outer" page of the project is so much higher than the count for the "inside" page.
As mentioned in the preceding discussion of technical problems, the people being "watched" may do things to avoid such statistics, intentionally or otherwise. Users concerned about cookies will disable them or use browser software that doesn't support cookies. Viewers asked to provide information may provide false information. Viewers may also share accounts where accounts are needed (particularly if they require payment), or they may simply share a computer and its cache. In addition, bookmarks in browsers can be set to point to pages beyond a poorly-designed account entry system, as can links on other pages. These direct links to sub-documents may be as specific as pointing to an image file, which circumvents any attempt at counting how many times the image is viewed.
World-Wide Web statistics have a lot to offer businesses trying to track their customers' behavior, but only if the statistics are not misleading. To that end, the following recommendations are provided. Following these recommendations will not necessarily result in a "perfect world" for Web statistics, but they are designed to address specific problems and make Web statistics more valuable to businesses.
This consists of setting a standard for what a "page hit" really means, or providing standard terminology to highlight the differences in meanings. This also consists of standardizing the way that counters are incremented.
2. Use CGI!
"CGI" is the term used for programs that send Web documents as a result of specific processing. The "normal" Web document is a plain text document with certain codes embedded to specify how the document should be displayed, using the HTML (HyperText Markup Language) format. However, a text file is not "intelligent" and cannot detect behavior, but CGI programs can analyze what is being requested and provide various levels of behavior analysis.
3. Consider HTTP modifications!
This is not a recommendation to make changes to the HTTP standard, but some changes to make statistics more valid (such as proxy/cache communication with servers) should at least be brought up for discussion.
4. Publicize weaknesses!
As long as there are significant discrepancies between true document behavior and the behavior reported by Web statistics, the weaknesses should be communicated to businesses who are using the statistics to make decisions about their online presence. The "truth in advertising" concept should apply; both providers and consumers should be aware of, and willing to discuss, the significant potential for Web statistics to be inaccurate.
In this paper, the focus has been finding an answer to the question: "If a company writes a check to a Web advertising firm, with assurances backed by hit statistics that the advertisement will be viewed (n) times, how can that number be verified?" And the answer to that question is, unfortunately, that the number often can't be verified. The reasons for this were discussed, by analyzing several methods of collecting statistics. Specific recommendations were also made to help alleviate the problems, although much of the inaccuracy (and therefore the bulk of the solution) rests with the technical environment of the Web itself.