Re: review of content type rules by IETF/HTTP community from Robert Burns on 2007-08-21 (public-html@w3.org from August 2007)

From: Robert Burns <rob@robburns.com>
Date: Tue, 21 Aug 2007 14:45:30 -0500
To: Leif Halvard Silli <lhs@malform.no>
Cc: Julian Reschke <julian.reschke@gmx.de>, Karl Dubost <karl@w3.org>, Dan Connolly <connolly@w3.org>, "public-html@w3.org WG" <public-html@w3.org>, Sam Ruby <rubys@us.ibm.com>
Message-Id: <A993C1BB-FED5-4838-95BF-E9FD4B73754E@robburns.com>
Hi Leif,

I think I'm seeing what you're saying now. I think you're looking to  
solve a more immediate problem than i was thinking about. I'll  
explain below.

On Aug 21, 2007, at 10:57 AM, Leif Halvard Silli wrote:

>
> 2007-08-21 16:26:19 +0200 Robert Burns:
>> On Aug 21, 2007, at 8:56 AM, Leif Halvard Silli wrote:
>
>>> The main thing that I agree very strongly with Karl in is that  
>>> the  offline and online "gap" should be bridged, and that this  
>>> can  happen through setting up clear/strict recommendations for  
>>> which  extensions to use - which all sides (authors, authoring  
>>> software,  browsers, servers) should pay attention to. This  
>>> bridging should  include official language and charset  
>>> extensions, taking example  from Apache, which also allready  
>>> offer its own such extensions, and  have done so for a very long  
>>> time allready.
>
>> For character encodings I think things are somewhat a mess. Most   
>> authors are not that aware of character encodings. To me its  
>> really  the type of thing authors should not have to worry about  
>> (if it had  been handled in a sane way form the start).
>
> Who said that I thought the authors should worry about them -  
> anymore than he cares if his applications uses .htm, .html or  
> anything else?  Or anymore than he cares about how the META tag for  
> encoding spesfication is written (which, btw, are very hard to  
> remember how to write)?

I was more thinking that authors should never have to think about  
encodings at all (instead being handled by text processing  
applications), though that may be a solution in a different parallel  
universe. In other words if the first one or two bytes of a every  
text file included a registered byte sequence that mapped to a  
specific character encoding, then every text editor would simply set  
those bytes according to which encoding it was serializing the file  
to. It's just the kind of thing that no one would need to think about  
(other than text processing application developers). The one  
exception is if a server wanted to make multiple encodings of the  
same document available. However, I don't think this is very common  
even though it's fairly easy to setup on Apache. Instead we just have  
a few encodings that everyone supports (and UAs are expected to  
support  many encodings too, which reduces the need for negotiation).

Of course we don't live in that universe and it's difficult for me to  
imagine how to get there. So much legacy software would not know what  
to do with those leading bytes. Users would see the leading bytes as  
control characters and delete them. And so on. However, the UTF  
encodings do get us very close to that with their byte order marks  
(BOMs). I would love to see us simply recommend (as in SHOULD) UTF-8  
or UTF-16 (with authoritative BOMs) for all HTML5 documents.

> The author should not be needing to care whether his or her  
> authoring application adds the charset extension or if it adds a  
> META element with charset information - or do it some other way.

Well many authors (of all skill sets) use text editors and interact  
with the serialization much more closely than that. The charset value  
is unlike any other  text an author composes in an HTML document.  
Changing the value of an attribute or typing a different element  
changes the meaning of a document. However, typing a different value  
for the charset does not change the meaning of the document; it just  
makes it wrong. File types are a bit different than that. I can  
change a filename extension from .html to .xhtml and it's not  
necessarily wrong (it's just I want a different processing of the  
document.  Though at other times it is just wrong (from .jpeg  
to .html). Again, encodings are very different in that the author  
changes something that the application should probably handle more  
opaquely (like through an invisible byte sequence).

> However, the sad thing is that **if** the author and his  
> application uses a charset extension, then, in a offline mileu, the  
> browsers are likely to not make any sense of the charset extension.

This is where I think you're onto something. It would be good for  
browsers to respect those extensions when opening local files.  
Especially since we don't have the parallel universe approach I'd  
like. However, it's not clear the filename extension would be all  
that much faster than looking inside the file for the internal metadata.

>> Finally, for languages, its useful for servers to have metadata  
>> about language at its disposal to quickly deliver to clients.
>
> These extension are useful also for authors. It is very practical  
> to discern different variants of the same file/content based upon  
> the file extension. For authors, to have to look into the file is  
> cumbersome.

Any author is free to use extensions in any way they want (as long as  
they don't get in the way of the final filename extension used for  
file typing). There's nothing we need to say in the draft or  
elsewhere to enable that practice, do we?

>> However, i  like the way HTML handles that already through the  
>> i18N language  features. Apache can even be configured to sniff  
>> inside the files as  they're added to the server to gather this  
>> data for quick indexing  for later.
>
> The problem which .html and .xhtml reveals is that the servers put  
> more weight on the file extension than what is written inside the  
> file.

See this is where I think there's an important difference. Filename  
extensions are the way we (authors) set the file type for files.  
Certainly we can set it incorrectly,, but unlike encodings, there can  
be multiple compatible file type treatments for the same file. In  
other words, the same file can be treated as .xhtml, .html, .txt or  
even as a raw stream of bytes. Setting the type that way is how the  
author indicates how to treat the file. No amount of intelligence can  
sniff inside the file to find out how the *author/user* wants that  
file treated. Once again, it makes no sense to treat a 8859-11 file  
as an 8859-2 file. The file's either one or the other.  It's either  
Russian or English (at least it's main language is one or the other;  
though I guess a mixed and balanced document might be switchable).

> Besides, one of the purposes of languge extensions is for content  
> negotiation. Well, if Apache can do that  without language  
> extensions, then fine, that's and extra feature (which even fewer  
> peopler know about.)

So would this be a recommendation for editing and conversion UAs? In  
other words, they would output filename extensions for encoding and  
language to be content negotiation ready?

>> So all of these pieces of metadata each have their own place I  
>> think.  The
>
> .HTML is also a metadata.

Yes, agreed. And it's place is in the file extension. Also, as you  
point out, Apache has popularized the practice of using extensions  
for other purposes too. However, there are two  very different  
situation: 1) a UA opening a local file, and 2) a server receiving a  
request for a file from a remote UA. It's easy for the local UA to  
simply check the internal metadata on the file. Whereas for the  
server it gets a request for a rile and then it needs to quickly find  
the right file to deliver to the remote client. In the first case the  
UA is just opening a file. In the second case the server has to find  
the right file. That to me is the reason for having the metadata  
handy (either in chained filename extensions in SQL databases or in  
filesystem attributes or extended attributes).

>> safest thing is to keep the authoritative data inside the file   
>> itself, and then extract it and index it in filesystem metadata  
>> or  elsewhere for quick retrieval. Many filesystems (and WebDAV  
>> too)  support extended filesystem
>
> That extraction process is not the simple solution that Karl asked  
> for. I want to save the file and test immediatly. And not wait for  
> Spotlight or a big fast computer.

Again, the local situation is different enough from the server, that  
it's really handled by the internal metadata. Or are you just saying  
you want editing UAs to assist authors in outputting this extracted  
metadata as filename extensions?

> Besides, even Mac OS X comes with Apache. And the reason why I, on  
> <MyOwnMac.local> get Apache's default index.html page in Norwegian  
> instead of English, is precisely because the installed version of  
> Apache has implemented filname extension based language negotiation.

So are you suggesting that the filesystems and file browsers change  
so that all files of the same content with different languages get  
presented as a single file. Then 'file.html' would really point to  
two files: 'file.html.utf8.ru'  and  'file.html.utf8.en'. So when  
double clicking on (or typing in the terminal open ) 'file.html', the  
filesystem would select the right one based on the user's stated  
language preference? Is that what we're talking about? Would users  
have their own encoding preferences (again, this seems like something  
most users wouldn't care about). That's why I was saying that each  
kind of metadata has it's place. They each have different relations  
to the file, the user preferences and the operating environment. I'm  
just trying to understand how we would leverage what Apache does for  
local files.

>> attributes. Some tools have started to  store this information  
>> there. Systems like Apple's Splotlight extract  authoritative  
>> metadata from files and store it in a sqlite database  for  
>> indexing (but also makes use of filesystem attributes and   
>> extended attributes alongside the sql). To me those approaches   
>> represent best practice.  Filenames (and their extensions) can be  
>> too  easily and inadvertently changed: losing that metadata. The  
>> best  thing to do is keep it inside the file (with the exception  
>> of file  type which has now had a long tradition of filename  
>> extension mapping).
>
> I am intersted in capitalizing on what we allready have. And I do  
> not see these file extension problems that you see. Besides, you  
> can put things both inside the file and in the file name. That is  
> very safe, if the content is lost - which can also happen.

I'm still not clear what purpose the filename extension metadata  
would serve. When it's already there (because of an Apache  
installation), it could be used, but how?

Also one of the points I tried to raise in my earlier response  
relates to modern filesystems. We're long past the filename extension  
disputes that flared up on Mac OS over the years. The internet is  
really what brought filename extensions for file typing to Mac OS (it  
was already happening before Mac OS X largely due to the web and the  
internet in general).

However we live in a very different time. In the early 90's Mac OS  
was really one of  the only widely used filesystems that supported  
file type attributes. Today nearly every widely used filesystem  
either already supports or soon will support extended filesystem  
attributes. This means we have better places to store extracted or  
otherwise determined metadata for files. Also with XML-RPC and the  
like the transport of metadata can be handled fairly easily. This  
means we can store and transport all sorts of file metadata without  
overloading the filename extension. As Sander hinted, this also means  
that file type settings and other metadata attributes can be  
localized (e.g., storing standard IANA Latin script encoding types  
while presenting them as fully localized language names). However, I  
guess I'm getting way off-topic here.

Take care,
Rob
Received on Tuesday, 21 August 2007 19:45:48 UTC