Invasion of the pseudo-people: character encoding in tedious detail

> From: Gavin Nicol <gtn@eps.inso.com>

-------------------------------------------------------------------------------------------------------------------
> I can accept the ENCODING parameter on the XML declaration as being of
> *informative* value, but if you have anything more reliable to use, it
> should be given priority.

We have been talking an either/or choice along our decision tree. But there
is another possibility: the encoding PI and the charset parameter (and locale
and user preferences) are just a priority list for autodetection.  In the usual
case, there should be agreement between the encoding PI and the charset 
parameter, I'd hope (since Gavin assures us of the future excellence of servers
in this regard :-) .   

I don't really like this, because I think we need to be clearer. It is a difficult problem
and it deserves good attention.  
 
----------------------------------------------------------------------------------------------------------
> Maybe we should just require that XML *always* be in utf8? (I diasgree
> on a personal level, but from one viewpoint, this has a lot in it's
> favor). 

Is that you suggesting this Gavin?  It would be nice if this were possible, for XML 10.0. 

------------------------------------------------------------------------------------------------------------
Without wishing to be too tedious, there are several different models, each leading
to different results:

PSEUDO-GAVIN

Let me invent a person called Pseudo-Gavin. He sees the need in these 
kind of terms:

1) we need a way for a document's encoding to be known by a server; &
2) we need a way for a document's encoding to be known by a client.

Number 2) is already handled by MIME.  Number 1) is better handled by
system dependent methods at the server end, ideally using MIME format.

PSEUDO-MAKOTOSAN

Let me invent another person called Pseudo-Makotosan. He sees the
need in these terms:

1) there should only be one primary method for a document to describe
itself; other methods are only in case of failure. PIs are the only way
to do this.

PSEUDO-RICKO

Let me introduce Pseudo-Ricko (? Is this what they call "reinventing
yourself" ?)  He thinks:

1) "horses for courses":where there is a reliable system-specific way to 
store, transmit or maintain character encodings, that way is to be preferred, 
since it will make the document integrate better into that system;

2) where there is no reliable system-specific way to store and maintain
character encoding, then the PI must be used;

This means:

* an http client should prefer MIME to PIs for received XML documents;
* a UNIX http server must use PIs because its files are undecoratable;
* a Macintosh http server should prefer PIs rather than charset data in
the resource fork, because a simple file transfer from another OS will
maintain the PI, but maybe won't set the resource fork correctly;
* a stream editor using UNIX pipes should have XML documents with PIs;

PSEUDO-RAVIN

Here is another fiction, Pseudo-Ravin. He thinks:

1) PIs are only reliable if there is smart transcoding (to rewrite the PI);
2) MIME is only reliable if there is smart transcoding (to rewrite the MIME charset);
3) http servers shouldn't invent a character encoding if the PI is available;
4) http clients shouldn't use something else if MIME charset is available;  
5) unthinking transcoding without altering the MIME or the PI
will always stuff things up: the issue for us is not "how to prevent stuff-ups" but
"how to allow reliablility"; and
6) an http server should rewrite the charset pseudo-attribute if it transcodes the
file; an http server should rewrite the charset pseudo-attribute if it transcodes
the file; so should an intermediate proxy.
 

PSEUDO-DRACO

Finally, a spector of Draco appears:

1) If an http client finds a file with a different MIME charset to its PI, then there
has been some dumb processing going on, and the file must be regarded as 
suspect, and therefore killed. 
2) This is really a problem of maintaining and verifying the integrity of data 
across uncontrolled systems. So XML files are binary, not text.  

Rick Jelliffe

Received on Monday, 9 June 1997 12:25:51 UTC