Re: B.1 and B.2 results from lee@sq.com on 1996-10-19 (w3c-sgml-wg@w3.org from October 1996)

From: <lee@sq.com>
Date: Sat, 19 Oct 96 18:29:59 EDT
To: dgd@cs.bu.edu, w3c-sgml-wg@w3.org
Message-Id: <9610192229.AA13350@sqrex.sq.com>
>    So, I suggest that we have the following problems:
> 
>   STEP 1: determine number of bytes in a character code (bit-combination).

[...]
> 
>    (STEP 1'): Or we can use the FFFE hack, if that will have wider support.

Probably, since it is explicitly suggested by Unicode.  Whether you call
it a hack or not depends on how you feel abut it, I think :-)

>    STEP 2: now we have a standard MIME-header, encoded in Latin1, using the
> byte coding as determined in step 1 (i.e., 1-byte, big-endian, or
> little-endian).

Not yet.  We have looked at the first two bytes of the file and determined
whether they are using the Unicode/10646 EFFF/FFFE option.

We have yet to determine
(1) that the files are in XML at all and not HTML or SGML or GENCODE...
(2) if they are in XML, what version of XML do they use?
    with talk of an XML 2.0, this is important.

> We will define the fields that we _require_ (charset, only,
> I hope). We will ignore, but permit, others. The CRLFCRLF sequence that
> terminates the header will start the XML instance. This header will be
> optional for channels (like HTTP) where the header information can be
> preserved, and required for all others.

OK, I see now.  You are suggesting that we put a MIME header in the
document in all cases.  I think this is an excellent suggestion.

Note that many existing web servers (including Apache) cope with
files containing MIME headers, and may even emit those headers in
response to an HTPP HEAD request.  Apache is said (independently) to
represent over 30% of all running web servers.

> If we go further, we could eliminate STEP 1,
> and parser the header in 8-bit mode always. Then people can use their
> standard codebase to parse the headers, but 2-byte systems might show the
> header as garbage

It's always a little tricky to talk about mixing character sets within
a single file.  However, since MIME headers are in US ASCII (or is
Latin 1 allowed now?), the headers must be in the subset common to both.
As long a you don't include people's names, filenames, document titles
or other displayable information in the MIME header in an XML file,
though, I don't imagine it would be a problem.  James?  Gavin?  Martin?

At a minimum, you would need
    Mime-version: 1.0
    Content-type: text/x-xml;version=1.0;charset=utf-8?

If we get text/xml registered as MIME content type, it can be text/xml
instead of text/x-xml, which would be good (I assume the leading x is OK!).

Instead of requiring the full MIME CR-LF at the end of each line (which
is a pain to mantain on some platforms, e.g. Mac and Unix), I would
suggest documenting a format in which
    * The lines are terminated by either NL or CR
    * Lines longer than 72 characters may be continued by starting
      the next line with spaces or tabs (HT) (as per RFC 822)
    * The header is terminated by
      CR LF CR LF
      CR CR
      or
      LF LF
      i.e. you must have two of the same kind.
You then get a header format which can easily and reliably be edited
on multiple platforms -- e.g. you can upload a file from your PC to
a Unix, NT or Mac server, and make a quick change in Notepad, Sam, or
whatever, without trashing the file header.  Of course, the editor
has to be able to write out the body of the file correctly!

(note that ftp in text mode will silently transform line endings into the
format preferred on the destination platform, RS/RE rules notwithstanding;
when you receive a file of format text/plain (e.g. 'cos the web server
isn't set up to recognise some content type or other), the same
transformation is made, which is why images don't transfer correctly if
they have odd filenmes or are in new formats the server doesn't grok;
it would be useful if XML were robust against such things as much as
possible)

Lee
Received on Saturday, 19 October 1996 18:31:23 UTC