Re: review of content type rules by IETF/HTTP community from Robert Burns on 2007-08-21 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Tue, 21 Aug 2007 09:26:19 -0500
To: Leif Halvard Silli <lhs@malform.no>
Cc: Julian Reschke <julian.reschke@gmx.de>, Karl Dubost <karl@w3.org>, Dan Connolly <connolly@w3.org>, "public-html@w3.org WG" <public-html@w3.org>, Sam Ruby <rubys@us.ibm.com>
Message-Id: <64F7F71D-BDFD-43B5-9C8D-3867F28A1F2C@robburns.com>
Hi Leif,

On Aug 21, 2007, at 8:56 AM, Leif Halvard Silli wrote:

>
> 2007-08-21 14:21:01 +0200 Julian Reschke <julian.reschke@gmx.de>:
>
>> Leif Halvard Silli wrote:
>>> ...
>>> Later Julian Reschke replied:
>>>> I think they do.
>>>> XHTML: <http://tools.ietf.org/html/rfc3236#section-2>
>>>> Template: <http://tools.ietf.org/html/rfc4288#section-4.11>
>>> One of Karl points was probably that one actually recommend  
>>> several extensions for (in this case) XHTML. By recommending  
>>> only .XHTML, XHTML-files would in most cases automatically be  
>>> served as 'application/xhtml+xml', and thus authors/users would  
>>> experience the effects of XHTML.
>> RFC3236 mentions XHTML, XHT and HTML.
>
> Like I said.
>
>> Apache 2.2.x comes with a preconfigured mapping file (mime.types)  
>> which has
>> 	application/xhtml+xml           xhtml xht
>> so as far as I can tell, it already does what you're looking for  
>> (and probably has for a long time).
>
> I am aware of this. And allthough there are more web servers than  
> Apache, and more browsers than Firefox, this might serve (sic) as  
> an example. (By asking Ian for examples of files.XHTML being served  
> as text/html, I suspect he expects to hear that there are very  
> _few_ such examples. In contrast, Ian has often been keen to  
> demonstrate that things doesn't work, e.g. showing how images being  
> served as text, will still being treated as an image by  
> browsers ... and other such things.)
>
> The main thing that I agree very strongly with Karl in is that the  
> offline and online "gap" should be bridged, and that this can  
> happen through setting up clear/strict recommendations for which  
> extensions to use - which all sides (authors, authoring software,  
> browsers, servers) should pay attention to. This bridging should  
> include official language and charset extensions, taking example  
> from Apache, which also allready offer its own such extensions, and  
> have done so for a very long time allready.

I'm not so sure I would characterize this as a problem between the  
online and offline worlds. The mappings of filename extensions to  
MIME types are already quite common in both worlds. The problem  
arises with mis-configured  servers or non-configured servers for new  
MIME types and new file extensions. As I understand it it also comes  
from servers trying to send default MIME types for files it's not  
sure about (instead of just admitting it doesn't know).

For character encodings I think things are somewhat a mess. Most  
authors are not that aware of character encodings. To me its really  
the type of thing authors should not have to worry about (if it had  
been handled in a sane way form the start). Adding filename  
extensions for encoding could be one approach (as a longtime Mac  
user, it doesn't really appeal to me too much, but we did make the  
adjustment to filename extensions for file types). However, I think  
Unicode has really introduced a better approach with, well, Unicode  
itself. But also the introduction of the Byte-order-mark, that does a  
fairly good job of identifying UTF-8 and UTF-16 encodings as those  
encodings. A logical extension off this (outside our scope) would be  
some sort off byte registry for character encodings.  Each character  
encoding could have its own one or two-byte sequence that each file  
started with. Once text editors had been updated to handle these  
registered bytes, authors would never have to think about it again.  
Every text file would always have its encoding tattooed on its forehead.

Finally, for languages, its useful for servers to have metadata about  
language at its disposal to quickly deliver to clients. However, i  
like the way HTML handles that already through the i18N language  
features. Apache can even be configured to sniff inside the files as  
they're added to the server to gather this data for quick indexing  
for later.

So all of these pieces of metadata each have their own place I think.  
The safest thing is to keep the authoritative data inside the file  
itself, and then extract it and index it in filesystem metadata or  
elsewhere for quick retrieval. Many filesystems (and WebDAV too)  
support extended filesystem attributes. Some tools have started to  
store this information there. Systems like Apple's Splotlight extract  
authoritative metadata from files and store it in a sqlite database  
for indexing (but also makes use of filesystem attributes and  
extended attributes alongside the sql). To me those approaches  
represent best practice.  Filenames (and their extensions) can be too  
easily and inadvertently changed: losing that metadata. The best  
thing to do is keep it inside the file (with the exception of file  
type which has now had a long tradition of filename extension mapping).

Take care,
Rob
Received on Tuesday, 21 August 2007 14:26:39 UTC