- From: Gavin Nicol <gtn@eps.inso.com>
- Date: Mon, 16 Jun 1997 08:10:10 -0400
- To: w3c-sgml-wg@w3.org
>We should use BOM and encoding declarations only. If a >document entity or an external entity does not have BOM or >an encoding declaration, it is in UTF-8. Period. This is quite repulsive. It seems to me that your group of experts knows little, or nothing, about network protocol design, and especially HTTP (which is quite obviously going to be the primary protocol use for XML delivery in the short term). >Other information or heuristics such as "Metadata provided >by the native OS file system or by document management >software" (4.3.3, Part 1) should not be used. Why not? >Encoding inheritance should not be introduced. I'm glad to see that you have changed your stance on this. This has far too many failure cases to be useful. >If HTTP or MIME headers provide encoding information, it >should be identical to the encoding specified in the >transmitted document (possibly implicitly by the XML default, >which is UTF-8). If not identical, the system is in error. How is a caching HTTP proxy server supposed to determine if this is the case without parsing the XML header? >Is this agreeable? I think this is very clear. This >is not always very convenient, but nobody or no systems >will be confused. I think that if you want to make a concrete proposal, you should provide some rationale. My concrete proposal is: 1) If external metadata (for example: HTTP protocol headers, file system information, metadata stored in a database) specifying the encoding is available, this should be given priority. It is a reportable error for the XML declaration to differ from the encoding specified by meta-data, but need not be fatal. 2) If external meta-data is unavailable, auto-detection, as specified in Appendex E. is to be used to determine the character encoding of the document. My rationale for (1) is that in open systems, and especially in the face of proxies, transforming/transcoding servers etc. the only (supposedly) reliable mechanism you have is the protocol. Without this, caching etc. all fall to peices. For (2), my rationale is that this is a reasonable fallback strategy. At a personal level, I am deeply troubled with requiring *any* encoding to be supported, and for the need for the XML declaration. Given that we do require some encodings, I think we should restrict ourselves to UTF-8.
Received on Monday, 16 June 1997 08:10:56 UTC