Re: B.1 and B.2 results from David G. Durand on 1996-10-21 (w3c-sgml-wg@w3.org from October 1996)

From: David G. Durand <dgd@cs.bu.edu>
Date: Mon, 21 Oct 1996 18:33:48 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <v02130502ae915f97546a@[128.148.157.46]>
At 12:05 PM 10/21/96, Gavin Nicol wrote:
>>OK, I see now.  You are suggesting that we put a MIME header in the
>>document in all cases.  I think this is an excellent suggestion.
>
>.... this is *precisely* what my *.mim file format (suggested to
>HTML-WG and also out in an expired RFC) *is*.
Well, my suggestion is that we put a MIME header when we can't transmit the
MIME header information over the channel. We don't want to have to send 2
headers to be XML conformant when going over HTTP.

>>Note that many existing web servers (including Apache) cope with
>>files containing MIME headers, and may even emit those headers in
>>response to an HTPP HEAD request.  Apache is said (independently) to
>>represent over 30% of all running web servers.
>
>Right, but the *.mim file format is different to Apache (or at least
>the last version I looked at) in that Apache sends the file *verbatim*
>and does not necessarily add missing headers... which means that the
>author must understand the entire set of required headers. The
>proposal I put forth only requires headers that will be overriding
>those generated by the server.

This would be essential for XML, as we don't want to force applications to
maintain HTTP specific information like Content-length, et. al.

>As I noted before on this list, and also in HTML-WG, most software
>that will be dealing with the WWW will *already* have MIME header
>parsers built into them.... probably as a message stream module, so
>you can *reuse* that code for the local and distributed case.
>
>Again, I seem to be talking to myself.

Well, perhaps to only a few people.

>The headers are in US-ASCII, which is a nuisance of your file is UCS-2
>(your editor would need to have MIME parsing capabilities built in),
>which is a boundary case, but an important one. This is one reason I
>prefer catalog or FSI based solutions. In most practical situations,
>this will not be an overly large concern though.
I think we are better off defining our own convention for "self-indetifying
files", as there is none in common use. If a common, robust, convention for
metadata is implemented, then systems that implement it are entitled to the
same slack (omission of redundant header) that we should afford HTTP. Given
the facts of life with multibyte encoding, and the desire that files be
maximally self-revealing, we should probably use the character-length
determination hack I suggested, ratehr than put 8-bit characters at the
front of multibyte files.

>>At a minimum, you would need
>>    Mime-version: 1.0
>>    Content-type: text/x-xml;version=1.0;charset=utf-8?
>
>In the *.mim file format, the minimum you would need would be CRLF,
>and for non-ISO-8859-1 documents
>
>    Content-type: text/x-xml;charset=shift-jis
>
>>Instead of requiring the full MIME CR-LF at the end of each line (which
>>is a pain to mantain on some platforms, e.g. Mac and Unix), I would
>>suggest documenting a format in which
>...
>
>I would just reference the HTTP specs (though HTTP 1.1 is becoming
>more restrictive), though I could easily be convinced that strict MIME
>compatability be preserved.

This is a minor issue. Implementations will implement the "all three
conventions" version for a long time, as it's so easy, and implementations
are so bad about linenends generally.

>
>The PI hack is a HACK. It is a header hiding under syntax that will
>confuse everyone, or at least cause people to assume that you could do
>something clever like:
>
><?XML-CHARSET SJIS>
>....
><?XML-CHARSET BIG5>
>....
><?XML-CHARSET UTF8>
>
>and we all know *that* is totally bogus.

   Because you can't parse the character set specification, without knowing
what character set to parse in... This is the most infamous of the SGML
declaration's problems with automatic processing: why revisit it on XML
users?

   -- David

RE delenda est.
I am not a number. I am an undefined character.
_________________________________________
David Durand              dgd@cs.bu.edu  \  david@dynamicDiagrams.com
Boston University Computer Science        \  Sr. Analyst
http://www.cs.bu.edu/students/grads/dgd/   \  Dynamic Diagrams
--------------------------------------------\  http://dynamicDiagrams.com/
MAPA: mapping for the WWW                    \__________________________
http://www.dynamicdiagrams.com/services_map_main.html
Received on Monday, 21 October 1996 18:34:01 UTC