WWWOFFLE - World Wide Web Offline Explorer - Version 2.8b
            =========================================================


One feature that has often been requested in WWWOFFLE is compression.  This is
either for compression from servers on the internet or for compression of files
in the cache.  Since adding compression of any sort is a big step I have
implemented it for both these cases and also from WWWOFFLE to the client.

The compression options are selectable at compile time and the individual
options are chosen using the WWWOFFLE configuration file.  This means that if
you are not interested in compression, or the compression library is not
available then the rest of the WWWOFFLE functions are still available.  If the
program is compiled with the compression library then you are not forced to use
it.


zlib
----

The simplest way of adding the compression functionality to WWWOFFLE is by
compiling with zlib.  This will provide support for the deflate and gzip
compression methods.

The zlib README file describes the programs as:

    zlib 1.1.3 is a general purpose data compression library.  All the code
    is thread safe.  The data format used by the zlib library
    is described by RFCs (Request for Comments) 1950 to 1952 in the files 
    ftp://ds.internic.net/rfc/rfc1950.txt (zlib format), rfc1951.txt (deflate
    format) and rfc1952.txt (gzip format).

The zlib library is not GPL software (like WWWOFFLE is), but the copyright file
for it says:

    Copyright (C) 1995-1998 Jean-loup Gailly and Mark Adler

    This software is provided 'as-is', without any express or implied
    warranty.  In no event will the authors be held liable for any damages
    arising from the use of this software.


The zlib library adds different types of functions for the different compression
methods.

For deflate/inflate there are functions that will take a block of memory and
compress (deflate) or uncompress (inflate) the contents.  The block of memory is
considered as part of a large compressed block and therefore the output of the
compression function depends on the previous inputs.  There are also
miscellaneous functions that are used to initialise the compression functions,
to flush out the data at the end and to finish with the compression functions.

For gzip/gunzip there are functions that can open a compressed file and read
from it (uncompression) or write to it (compression).  All of the gzip/gunzip
compression/uncompression functions are based on file compression/uncompression
and not in-memory compression/uncompression.


The non-availability of in-memory block-by-block gzip/gunzip functions is a
problem since WWWOFFLE needs to be able to compress data as it is flowing from
the server to the client.  A temporary file could be used, but this is not a
good solution in general.  To work around this problem the gzip/gunzip source
code in the zlib library was examined and their operation (using the deflate and
inflate algorithms with some extra fiddling at start and end) was implemented as
in-memory block-by-block functions.


Compressed Files From Server To WWWOFFLE
----------------------------------------

HTTP/1.1 and Content Negotiation
- - - - - - - - - - - - - - - -

For the compression functions to be worth having on the link from the server to
the client there must be servers that support the function.

Fortunately the HTTP/1.1 standard defines a mechanism by which clients can
indicate to servers that they will accept compressed data.  The servers reply
with compressed data and indicate the method by which the data has been
compressed.  This is a specific instance of the content negotiation functions of
HTTP/1.1.

Unfortunately the definition of HTTP/1.1 and content encoding leads to ambiguous
results.


Theory
- - -

The way that it works is the following (in theory) for gzip compression.

1) The client makes request with a header of 'Accept-Encoding: gzip'
   this means that it can handle a gzipped version of the URL data.

2) The server looks at the request and supplies a 'Content-Encoding: gzip'
   reply and a compressed version of the data for the requested URL.

3) The client receives the data, see the 'Content-Encoding: gzip'
   header and decompresses the data before using it.

An important remark to make at this point is that the HTTP standard defines the
'Content-Encoding' header to apply to the complete link from server to client
through any proxies.  It is not intended to apply separately for the server to
proxy and proxy to client links.  There is a 'Transfer-Encoding' header for
this, but it not generally used.


Problem 1
- - - - -

The use of compression is fairly rare and there are problems with clients, even
without the use of WWWOFFLE.

For example Netscape version 4.76 will ask for gzip compressed data and will
display the HTML fine.  The problem is that if the images in the page are also
sent compressed then they are displayed as the 'broken image' icon.  If you view
any single image from the page then it is OK.  This indicates to me that the
browser knows how to handle gzipped data for the HTML page and for the images,
but not for images inside a page!

Mozilla version M18 works fine with the same page and same images, so it must be
a client problem.


Problem 2
- - - - -

When a request  is sent for a URL  that is naturally compressed then  even if no
'Accept-Encoding' header is  sent the data comes back  with a 'Content-Encoding'
header.  So  for example  if a user  requests the  URL http://www.foo/bar.tar.gz
then the data comes back gzipped with a 'Content-Encoding: gzip' header.  If the
user saves the file from the browser then  he expects that it is saved to a file
called bar.tar.gz and that it contains a compressed tar file.

The problem here is that WWWOFFLE adds in an 'Accept-Encoding' header on all
requests and decompresses the ones that come back with a 'Content-Encoding'
header and removes the header.  This just breaks what I have described above,
the file bar.tar.gz that the browser writes out is actually a tar file and not a
compressed tar file.  WWWOFFLE has no way to know that the data that was
requested was compressed in its natural form or if the compression was added as
part of the content negotiation.


Problem 3
- - - - -

When WWWOFFLE is used and it performs the uncompression for the client then this
can also cause problems.

The Debian Linux package manager program 'apt-get' requests files called
Packages.gz from the server.  If WWWOFFLE uncompresses these and sends them to
the client uncompressed then apt-get fails because the file is not compressed
like it expects.

Solution
- - - -

The only solution that I can see is that WWWOFFLE does not decompress any files
that it thinks might be naturally compressed, e.g. *.gz files.

This means that the configuration file for WWWOFFLE needs to contain a list of
files that it does not request compressed and does not try to decompress.


Problem 4
- - - - -

Due to the browser problems quoted above (Problem 1) there are servers that will
only send compressed content to browsers that they know will accept it.  This
relies on the User-Agent header that the browser sends in the request.

The problem here is that when people hide the browser that they are using by
changing the User-Agent (either in the browser or using the WWWOFFLE
CensorHeader options) the compression may not be performed.

One server that does this is www.google.com which only sends compressed data to
clients that it thinks can handle it.

Solution
- - - -

There are two solutions here, either the user has to choose a fake User-Agent
that will work (but there is no list of those and different servers may use
different ones) or the server needs to be modified.


Compressed Cache
----------------

The problems described above with ambiguity in the meaning of the
'Content-Encoding' header also cause problems with the compressed cache.


Problem 1
- - - - -

If the file is stored in the cache with a 'Content-Encoding' header then
WWWOFFLE would need to decide if it should decompress it before sending it to
the browser.  This needs the same list of files not to compress that are
mentioned in the solution above.


Solution 1
- - - - -

Two solutions to this problem present themselves.

1) Make the 'Content-Encoding' something that WWWOFFLE will recognise as being
compressed by itself.  For example 'Content-Encoding: wwwoffle-deflate' could be
used to indicate files that WWWOFFLE compressed in the cache and that need to be
uncompressed when they are read out again.

2) Add another header into the cached file and use a standard content encoding.
There is a header called 'Pragma' that can be added to any HTTP header, its
meaning can be defined by the user, unrecognised headers should be ignored.

The first option is the simplest, but leaves the cache files in a non-standard
format.  The second option means that the file itself is still a valid HTTP
header followed by data.

The second option is the one that is implemented.


Problem 2
- - - - -

Another problem with the compressed cache format is that many files are already
compressed.  For example images will nearly always be compressed (GIF, JPEG and
PNG all include compression).  These files will not benefit from being compressed again


Solution 2
- - - - -

As for the solution listed for the server transfer problem a list of files not
to compress in the cache is needed.  In this case since the file exists in the
cache already it is possible to add a list of MIME types not to be compressed,
e.g. image/jpeg.


Compressed Files From Server To WWWOFFLE
----------------------------------------

Now that the problems have been examined for the previous two cases the problem
is solved.  The list of MIME types that are used for the cache compression are
used for deciding if it is worth compressing the file to send to the browser.


Problems with Compression Formats
---------------------------------

Problems with the format of data sent from servers to WWWOFFLE have caused a
variety of problems.

The format of data the is normally sent back from servers when deflate
compression is requested is not what is described in RFC 2616 the HTTP/1.1
specification.  The format of the data in this case should have a 2 byte header
and 4 byte trailer (as described in RFC 1950) around the deflated data (as
described in RFC 1951).  The common format that is used is that the extra header
and trailer are not sent, just the deflated data.

If this is the de-facto standard on the internet then it would not be a problem
and WWWOFFLE could request deflated data and not have a problem reading it.
Unfortunately it is not this simple, there is still the possibility of receiving
the correct zlib formatted data.  There are also web servers that are even worse
because they send back a 10 byte gzip header followed by the 2 byte zlib header
and then the deflated data.

The only solution to this is that WWWOFFLE waits for the first few bytes of data
to be received and then makes a choice about the format based on what it sees.
This is the approach that is now taken in version 2.8 of WWWOFFLE, the first 16
bytes of data are accumulated and then a decision is made.


Andrew M. Bishop
9 December 2003