WWWOFFLE Server Information

This page highlights some of the web-servers that do not follow the published standards for HTTP or URL format. There are many web-pages that handle the bad HTML that many servers contain, this is not intended to replicate them, but to concentrate on a different area.

Introduction

There are a number of standards that web-servers and the CGIs on them should follow.
HTML standards
These are published by the World Wide Web Consortium http://www.w3.org/.
HTML 4.01
HTTP standards
These are published by the Internet Engineering Task Force http://www.ietf.org/.
Hypertext Transfer Protocol -- HTTP/1.0 (RFC 1945).
Hypertext Transfer Protocol -- HTTP/1.1 (RFC 2616).
URL standards
These are published by the Internet Engineering Task Force http://www.ietf.org/.
Uniform Resource Locators (URL) (RFC 1738).
Relative Uniform Resource Locators (RFC 1808).

WWWOFFLE needs to work with the web servers that are in use on the internet, not just those that conform to the standards. This causes a number of problems and there are correspondingly a number of places in the WWWOFFLE source code that work-around such problems.

Invalid HTTP Replies

The format of the headers is well defined in the HTTP specifications.

Redirection headers

When a browser requests a URL there is often the need for the server to redirect the browser to a different URL. For example when the user requests a URL that is actually a directory and there is no '/' at the end the server will redirect the browser to the same URL with a '/' appended.

The method that is used for this is that the server sends a reply that contains a Location header. The other part of this Location header is the new URL that the browser is to load. The HTTP specifications say that this URL must be an absolute URL. This means that it must include the scheme (e.g. http) and the hostname as well as the path.

Despite the very clear statement in the specification there are still servers that insist on sending back invalid Location headers. The usual mistake is that the URL that is included is a path on the server and not an absolute URL.

Invalid URLs supported or Valid URLs not supported

A feature of WWWOFFLE that can cause problems is in the parsing of URLs.

The URL formats defined in the standards contain methods of converting non-printing characters and reserved characters to character sequences. For example the character @ in a URL must be replaced by %40 (the hexadecimal ASCII code for the @ character is 40). This conversion to hexadecimal form is allowed for all characters and is called url-encoding.

In version 2.6 of WWWOFFLE changes were made to ensure that the URLs that are used in all places conform to the standards. This meant that a number of the URLs that are used needed to be changed (especially where one URL was used as the argument to another one). Also to avoid the same URL being cached twice with different url-encodings WWWOFFLE will convert all URLs that come from browsers into a canonical format.

When a Space is Not a Space

One example of a problem with this is a CGI that does not accept all formats of HTML form data that the HTML 4.01 specification says that it should. The url-encoding of characters described above is not the complete story. The space character in form data should be replaced by a + character (HTML 4.01 specification section 17.13.3 & 17.13.14) and not by %20.

The URL in question performs a search based on the form arguments, http://www.xxx.xx.xx/xxxxxxx/xxx.xx/xxxxxx/?aaaaa%20bbbbb%20ccccc as requested by a browser (itself non-conforming to the specification) is converted by WWWOFFLE into the URL http://www.xxx.xx.xx/xxxxxxx/xxx.xx/xxxxxx/?aaaaa+bbbbb+ccccc.

The first of these two formats is accepted by the CGI and returns the expected result, the second form just causes the CGI to say that the search failed.

When a Question Mark Should Not Be Used

Another example is a URL that only works if the characters that should be URL-encoded are not encoded. There is a list of the characters that should always be url-encoded, this includes the characters that separate parts of a URL. One obvious one is the ? character that is used to mark the start of HTML form arguments.

A common mistake is not to url-encode the required part of the URL when one URL is used as an argument to a CGI. For example if a CGI takes the URL http://www.foo/bar.cgi?arguments as an argument then the URL for the CGI is often written as http://www.bar/foo.cgi?http://www.foo/bar.cgi?arguments.

This is wrong since the second ? character in the URL should be replaced by %3f to give the URL http://www.bar/foo.cgi?http://www.foo/bar.cgi%3farguments.

This is only a problem when the CGI does not accept a URL in this format. In fact this occurs so often that I have had to change WWWOFFLE not to url-encode a ? character that occurs in an HTML form argument.

The same problem has also been reported with the @ character used in the argument to a form with url-encoding not working.

For reference the list of characters that should always be url-encoded when used for something other than their reserved meaning is ;, /, ?, :, @, = and &. WWWOFFLE makes an exception for the characters / and : in form arguments and : and = in paths.

Authentication Problems

The HTTP/1.0 and HTTP/1.1 specifications contain a description of a method called "Basic Authentication" that allows a web browser to identify itself to a web server. This can be used to create web pages that are accessible only by allowed groups of users. The authentication works by assigning a username and password to the page and only allowing access to the page when the correct credentials are supplied.

Note: There is no security in this method, it relies on the browser sending the username and password in the equivalent of plain text in the web page request. WWWOFFLE uses this information to cache the page differently for different username/password combinations. Any other web-cache or server that the data travels through can also see the username and password.

The sequence of events that happen when first requesting a page using this authentication is:

  1. The browser requests the URL without a username or password.
  2. The server sends back a WWW-Authenticate header containing the authentication method (Basic) and a realm (an indicator of which region of the server space the password applies to).
  3. The browser asks the user for the username and password to use for this realm.
  4. The browser sends a request to the server containing a Authorization header containing the username and password.
  5. The server sends the secured web-page if the username and password are correct. If they are incorrect then the server goes back to step 2.
If at step 3 the user presses cancel in the browser pop-up window (the usual method of asking for the username and password) then the page that was returned in step 2 is displayed. Normally this page contains information that indicates that a secured page was requested.

The realm that is used in steps 2 and 3 is just a string that is defined by the server to indicate which authentication is required. It is possible to have more than one realm on the same server. The same realm value on different servers are treated differently. There is no restriction in the HTTP specifications that says which pages must belong to a realm. For example in a directory that has authentication with one realm sub-directories may not have any authentication or they may have a different realm. The realm is not returned from the server in step 5.

The concept of the realm is where people become confused. Browser writers often assume that all pages in the same directory (or a subdirectory) have the same realm. They may even assume that all pages on the same server are in the same realm. This means that when requesting a page after having visited an authenticated page the browser tries to save time by just jumping to step 4 in the list above. When this works it is OK, but since the realm definition doesn't allow this shortcut to always work it can cause problems. The browser can't know that the username and password for the realm is required since the realm is not returned after authentication (step 5).

WWWOFFLE will always follow all five steps listed. This is not just because it is the right way to do it, but for a practical reason. When WWWOFFLE stores the pages that have been authenticated it puts the username and password into the URL to decide the filename to use. It does not show this secret filename to the user in any of the indexes. WWWOFFLE needs to ensure that when a user goes back to a URL that has been cached with authentication that it is available. If the browser has been stopped and restarted then the username and password for the realm is not known by the browser. This means that by storing the result of step 2 in the list that WWWOFFLE can trigger the browser to follow steps 3, 4 & 5. If only the result of step 5 is stored then there is no way to prompt the browser to ask the user for the username and password since the realm is not known. Also the advantage of this is that if no authentication is required then the page is cached for all users.

Server and Browser Authentication Run-around

Given the problems described above and the shortcut that browsers take leads to a trick used by some servers. Instead of using the five steps listed above they make up their own 10 step scheme that relies on browsers that take shortcuts (replacing 5 steps with 10 doesn't seem like a shortcut to me):
  1. The browser requests the URL (for example /readme.html) without a username or password.
  2. The server sends a reply with a Location header that tells the browser that the URL has moved. For example it may direct the user to a page called /login.html?page=/readme.html.
  3. The browser requests the page called /login.html?page=/readme.html without a username and password.
  4. The login.html?page=/readme.html page performs the five step authentication scheme shown in the description above.
    1. The browser requests the page without username/password.
    2. The server send the WWW-Authenticate header.
    3. The browser prompts the user.
    4. The browser requests the page with username/password.
    5. The server sends back the page.
    The last step of this contains a Location header that tells the browser that the URL has moved to /readme.html.
  5. The browser requests the page /readme.html with a username and password (since the browser takes the shortcut of assuming that the same realm applies to both pages).
The problem with this scheme is with browsers that don't try and cut corners. They reach the last step and send a request for a URL without a username and password. They don't know that the same realm applies (they can't due to the definition of a realm). This is therefore the same request as in step 1 and the story repeats. (Except that the browser may remember that /login.html?page=/readme.html requires a username and password, thereby cutting a 10 step infinite loop to an 7 step one.) WWWOFFLE acts this way. But since all of the pages are already cached it can let you execute the infinite loop much faster.

The work-around for this (not really a solution) is to set the try-without-password=no option in the OnlineOptions section of the WWWOFFLE configuration file. This will mean that WWWOFFLE no longer tries to request the page without a username and password first. It will request only the page that the browser requests. This will lead to users being unable to access /readme.html when offline (after a browser reset) since only the version with a username and password is available.