W3C home > Mailing lists > Public > w3c-sgml-wg@w3.org > June 1997

Re: Invasion of the pseudo-people: character encoding in tedious detail

From: Gavin Nicol <gtn@eps.inso.com>
Date: Mon, 9 Jun 1997 14:52:38 -0400
Message-Id: <199706091852.OAA12156@nathaniel.ebt>
To: w3c-sgml-wg@w3.org
>> I can accept the ENCODING parameter on the XML declaration as being of
>> *informative* value, but if you have anything more reliable to use, it
>> should be given priority.
>
>We have been talking an either/or choice along our decision tree. But there
>is another possibility: the encoding PI and the charset parameter (and locale
>and user preferences) are just a priority list for autodetection. 

I don't like word "autodetection" in the sentence, and would prefer
"determination". In other words, unless I can *know*, with a
reasonable degree of certainty, what the encoding is, I consider the
system broken.

>> Maybe we should just require that XML *always* be in utf8? (I diasgree
>> on a personal level, but from one viewpoint, this has a lot in it's
>> favor). 
>
>Is that you suggesting this Gavin?  It would be nice if this were
>possible, for XML 10.0.  

Guilty as charged. As you and I know, there are many reasons why this
is still not possible....

>PSEUDO-GAVIN
>
>1) we need a way for a document's encoding to be known by a server; &
>2) we need a way for a document's encoding to be known by a client.
>
>Number 2) is already handled by MIME.  Number 1) is better handled by
>system dependent methods at the server end, ideally using MIME format.

You got the stance right, without most of the justification,
unfortunately. 

>PSEUDO-MAKOTOSAN
>
>1) there should only be one primary method for a document to describe
>itself; other methods are only in case of failure. PIs are the only way
>to do this.

I think his stance is a bit further afield than that. Seems like they
want all kinds of autodetection in there.

>PSEUDO-RICKO
>
>1) "horses for courses":where there is a reliable system-specific way to 
>store, transmit or maintain character encodings, that way is to be preferred, 
>since it will make the document integrate better into that system;
>
>2) where there is no reliable system-specific way to store and maintain
>character encoding, then the PI must be used;

I have no argument against this, and indeed, this is very close to the
start of my thought process.

>This means:
>
>* an http client should prefer MIME to PIs for received XML documents;
>* a UNIX http server must use PIs because its files are undecoratable;
>* a Macintosh http server should prefer PIs rather than charset data in
>the resource fork, because a simple file transfer from another OS will
>maintain the PI, but maybe won't set the resource fork correctly;
>* a stream editor using UNIX pipes should have XML documents with
>PIs;

This is where we diverge. I would like to remove the restrictions on
all these systems, rather than adjust to them.

>PSEUDO-DRACO
>
>Finally, a spector of Draco appears:
>
>1) If an http client finds a file with a different MIME charset to
>   its PI, then there has been some dumb processing going on, and the
>   file must be regarded as  suspect, and therefore killed. 
>2) This is really a problem of maintaining and verifying the
>   integrity of data across uncontrolled systems. So XML files are
>    binary, not text.   

I think position (1) is reasonable, and this is a reportable error in
XML today.
Received on Monday, 9 June 1997 14:53:22 EDT

This archive was generated by hypermail pre-2.1.9 : Wednesday, 24 September 2003 10:04:40 EDT