Re: Determination of Encoding (Re: Invasion of the pseudo-people ...) from Gavin Nicol on 1997-06-16 (w3c-sgml-wg@w3.org from June 1997)

From: Gavin Nicol <gtn@eps.inso.com>
Date: Mon, 16 Jun 1997 08:10:10 -0400
To: w3c-sgml-wg@w3.org
Message-Id: <199706161210.IAA27573@nathaniel.eps.inso.com>

>We should use BOM and encoding declarations only.  If a
>document entity or an external entity does not have BOM or 
>an encoding declaration, it is in UTF-8.  Period.

This is quite repulsive. It seems to me that your group of experts
knows little, or nothing, about network protocol design, and especially
HTTP (which is quite obviously going to be the primary protocol use
for XML delivery in the short term).

>Other information or heuristics such as "Metadata provided 
>by the native OS file system or by document management 
>software" (4.3.3, Part 1) should not be used.   

Why not?

>Encoding inheritance should not be introduced.

I'm glad to see that you have changed your stance on this. This has
far too many failure cases to be useful.

>If HTTP or MIME headers provide encoding information, it 
>should be identical to the encoding specified in the 
>transmitted document (possibly implicitly by the XML default, 
>which is UTF-8).  If not identical, the system is in error.

How is a caching  HTTP proxy server supposed to determine if this is
the case without parsing the XML header?

>Is this agreeable?  I think this is very clear.  This 
>is not always very convenient, but nobody or no systems 
>will be confused.

I think that if you want to make a concrete proposal, you should
provide some rationale.

My concrete proposal is:

1) If external metadata (for example: HTTP protocol headers, file
   system information, metadata stored in a database) specifying the
   encoding is available, this should be given priority. It is a
   reportable error for the XML declaration to differ from the
   encoding specified by meta-data, but need not be fatal.
2) If external meta-data is unavailable, auto-detection, as specified
   in Appendex E. is to be used to determine the character encoding of
   the document.

My rationale for (1) is that in open systems, and especially in the
face of proxies, transforming/transcoding servers etc. the only
(supposedly) reliable mechanism you have is the protocol. Without
this, caching etc. all fall to peices.

For (2), my rationale is that this is a reasonable fallback strategy. 

At a personal level, I am deeply troubled with requiring *any*
encoding to be supported, and for the need for the XML
declaration. Given that we do require some encodings, I think we
should restrict ourselves to UTF-8.

Received on Monday, 16 June 1997 08:10:56 UTC