- From: Dany Talisman <dani@capella.co.il>
- Date: Mon, 09 Feb 1998 23:11:06 +0300
- To: www-html-editor@w3.org
There is some unclarity and what seems to be a real problem with the mechanism described in Section 5.2.2 of the current HTML 4.0 specification, with regard to using the META tag to specify a character set. I couldn't resolve the following from the text as it currently stands: 1. What character set is to be used to locate the META tag? There really should be a requirement that this should always be in some standard encoding, otherwise tools would be hard-pressed to parse the META tag itself. There is a hint to that effect saying that "The META declaration must only be used when the character encoding is such that ASCII characters stand for themselves at least until the META element is parsed.". However, this does not provide a general solution for the case when the encoding is not compatible with ASCII characters (e.g., simple 16-bit unicode). How can a tool safely locate the META tag in such a case? Not all HTML documents are accessed via an HTTP server; some are read directly from the disk, which rules out any "external databases" which could resolve this problem; I assume that there should be a standard way of reading such files. 2. Exactly at which point is the encodings switch specified by a META tag made? Immediately at the end of the META tag, or at the end of the HEAD section? Given that the head section may need to contain text which is not easily expressed in ASCII, I presume the intent is that the change is made immediately after the '>' ending the META tag. This is hinted by the recommendation that "META declarations should appear as early as possible", but there's no explicit wording to that effect. 3. Suppose it is valid to use the META tag to change encoding at some point in the text. Also suppose it is also valid for the whole document to be in any encoding, without a META tag, given that the HTTP server correctly reports its encoding in the header. This seems to be the current state of the standard, as hinted by the fact that "Some servers examine the first few bytes of the document, or check against a database of known files and encodings" to determine encoding. For documents of the first type, the HTTP server is expected to report in the header the encoding specified in the META tag. However, the text up to the META tag is encoded in ASCII. How is a user agent expected to parse this text, if the encoding is not compatible with ASCII? As far as the agent is concerned, such files look the same as files which use this encoding from the first character, without a META tag. I tried thinking of ways this could be handled under the current standard or with minimal changes. All except the last suffer from severe compatibility problems with current servers: - Have the HTTP server encode the text up to the META tag using the specified encoding. This raises another problem: if you read an HTML file from the disk, you get a different result then reading it from an HTTP server. Even worse, current servers will not do the encoding, new ones will, with the result being that you'll never know exactly what to expect. - Have the HTTP server send a new header entry, which indicates that the specified character is only used from some point onward. This would require the user agent to search for the META tag to determine this switching point. It still faces the compatibility problems with existing servers, which would not send this data. - Require that the character set specified in the HTTP header be used from the first character. The META tag would cause a switch from this initial character set to any new one. This would again require that user agents to look for it. It would also require servers to make a special case for the character set META tag, and not place it in the header; again a compatibility problem with current servers. - Likewise (use the HTTP header character set from the first character, switching later on), but instead of using a META tag which affects the HTTP header, use one which doesn't. The only disadvantage I could find with this method is that it does not agree with the current standard :-) IMVHO, this is the right choice, combined with a strong recomendation that all files start with an ASCII encoding up to this tag, so as to be usable directly as well as through HTTP servers. Quite likely I'm missing something here, in which case a simple clarification of the intent in section 5.2.2 would solve the problem. Sorry if this is covered somewhere else; I tried looking in other sections (the META tag definition, for example), but none seemed to give a definite answer. Thanks, Oren Ben-Kiki oren@capella.co.il
Received on Monday, 9 February 1998 16:11:37 UTC