- From: MURATA Makoto <murata@hokkaido.email.ne.jp>
- Date: Mon, 21 Apr 2003 08:43:09 +0900
- To: www-tag@w3.org
- Cc: simonstl@simonstl.com, dan@dankohn.com, duerst@w3.org, Murata <murata@hokkaido.email.ne.jp>, ishida@w3.org
Chris Lilley wrote: > The following language is proposed for section "4.2. Processing model" > of the Web Architecture document. It also includes a lead-in sentence > to be added to the overall Section "4. Representations" introductory > material. > http://www.w3.org/2001/tag/webarch/ I propose to omit the second para of proposed text. For application/xml and non-text "+xml" media types, charset parameters must not be produced unless they give the same information as the xml encoding declaration would give. Here is the reason for my proposal. Ideally, there should be either 1) a single in-band encoding declaration mechanism for all textual formats or 2) a single out-band encoding declaration mechanism for all protocols, but not both. Everybody knows the current situation is far from this ideal one. HTML, CSS, and XML have different mechanisms for in-band encoding declarations. Most file systems and some old protocols lack out-band encoding declaration mechanisms. Different specs and implementations have different precedence rules for the same format. Many programs rely on charset sniffing. This unfortunate situation certainly has to be improved. However, I do not understand which direction the TAG is heading for. Since the TAG is about the technical architecture of the Web, I would request that the TAG gives a general direction rather than making a final decision on specific issues. Apparently, the I18N WG should be involved. I have some experiences with non-ASCII WWW pages consisting from XML, HTML, CSS, JSP, Servlets, CGI, XSLT, and Javascript. At present, there are so many different mechanisms and WWW developers have to use them appropriately. Since this is immensely difficult, IT magazines in Japan have often published articles about corrupted WWWW pages. In my opinion, the difficulty is not caused by a particular mechanism but rather by inconsistencies among many mechanisms. However, the TAG has considered the XML media type issue only. I would request that the TAG study all existing mechanisms and then give a general direction. The encoding declaration issue certainly arises in non-XML formats. For example, when the RELAX NC TC of OASIS designed the compact syntax of RELAX NG, it wondered whether it should provide in-band declaration mechanisms. At present, it does not have any such mechanisms, since the RELAX NG TC felt uneasy to introduce yet another mechanism for in-band encoding declaration. I suppose that other compact formats (e.g., XQuery and probably RDF Schema) will have the same problem. I also know that many formats (e.g., Javascript and VBScript) rely on charset sniffing. I wrote that there should ideally be a single mechanism for in-band encoding declaration. For this reason, I now think that XML declarations are hopelessly broken. They should have been @charset ISO-8859-1?; <?xml version="1.0"?> rather than <?xml version="1.0" charset="ISO-8859-1"?> The former is in sync with CSS2 and correctly separates the octet sequence layer and the character sequence layer, while XML 1.0 confuses these two layers. If all textual formats (including Javascript, Perl, ruby, etc.) had adopted the same mechanism for in-band encoding declarations, the current situation should have been at least more consistent. (The in-band .vs. out-band issue will not disappear, however.) At present, the only general-purpose mechanism is the charset parameter, which is (vaguely) recommended by RFC 2130 and "Character Model for the World Wide Web 1.0". If W3C agrees that the charset parameter should be discouraged, it is probably a good idea to explicitly say so in a standard-track RFC, "Character Model for the World Wide Web 1.0", and the WWW architecture document of W3C. For the record, the following paragraph quoted from http://lists.w3.org/Archives/Public/www-tag/2003Apr/0034.html is incorrect. When the charset parameter of application/xml is omitted, XML processors MUST use the BOM or encoding declarations. > In section "3.2. Application/xml Registration" RFC 3023 also lists, > unusually, a charset parameter, declares it autoritative in all cases > over HTTP even when not present, ... [1] http://www.asahi-net.or.jp/~eb2m-mrt/charsetDetection.html -- MURATA Makoto <murata@hokkaido.email.ne.jp>
Received on Sunday, 20 April 2003 20:06:59 UTC