W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2002

Re: Web archive processing

From: Terry Teague <terry_teague@users.sourceforge.net>
Date: Tue, 15 Jan 2002 00:31:31 -0800
Message-Id: <l03130300b869938d27f2@[17.219.108.52]>
To: html-tidy@w3.org
At 11:01 AM +1300 1/15/02, Richard A. O'Keefe wrote:
>I'm teaching a small group of students a very short course on web usability.
>I have them using ICab (so that they can see ICab frown and read its error
>log), and have got them to save a frowny page so that they can drop it into
>MacTidy.  This is fine, and tells them a lot they need to know, but one
>problem came up.
>
>The "File|Save As" menu choice in ICab offers three options:
>* Plain text
>* HTML (without images)
>* Web Archive (with images)
>A couple of students made the obvious choice that they didn't want to
>throw the images away, so they chose "
>Web Archive" as the output form.
>MacTidy apparently doesn't know about Web Archive form, whatever that is,
>and as far as I can tell neither does the UNIX source version of HTML Tidy.
>
>Could someone
>- tell me where I can find out about Web Archive form so that I can teach
>  Tidy about it myself (as if I had lots of free time)
>- or implement this feature (as if anyone else has lots of free time either).

Of course you are quite as capable as I am at using a search engine <grin>.

Depending on where you look you may get slightly different answers. I don't
claim to have any expertise in the areas of Web Archives or iCab.

But basically as I understand it, the Web Archive format was introduced
with Internet Explorer 4.0.x, and is basically a standard Java ".jar" file,
with the extension changed to ".war". Apparently there were some bugs with
early IE 4.0 versions and Web Archives, and based on the comments below, it
is possible that Microsoft "embraced and extended" the jar format.

The jar format is basically the well known "zip" format (registered
trademark of PKWARE, Inc. By the way, the inventor of zip, Phil Katz,
passed away in April 2000).

Here is what the authors of iCab had to say about their Web Archive format :

"The MS Internet Explorer is able to save whole web sites in a single file
called web archive. Unfortunately Microsoft introduced a new file format
for these archives which can only be loaded by the Internet Explorer
itself. So accessing certain files from the archive is very difficult and
complicated.

The program "WAC" converts the IE web archive into a ZIP archive. The ZIP
archive can be unpacked with StuffIt or ZipIt on the Mac, but it can also
unpacked on DOS/Windows using PKZip or Unix using UnZip and any other
computer platform. The ZIP archive can be directly loaded by iCab because
iCab (since Preview 1.7) uses the ZIP archive format as its own web archive
format.

To start converting a web archive you should drag the archive file on the
WAC program icon.  You can drag any number of IE web archives to the WAY
program at once."

The Web Archive Converter (for Mac OS) can be downloaded from :

<http://www.icab.de/icab/soft/WAC.sit>


My opinion is we don't want to modify Tidy to handle either ".zip" or
".jar" files directly, but just use existing external tools to unarchive
the files, then process them with Tidy.

If you feel we need modifications to Tidy or MacTidy, to handle web
archives, let's talk...

Regards, Terry
Received on Tuesday, 15 January 2002 03:33:13 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:51 GMT