- From: Tex Texin <tex@i18nguy.com>
- Date: Mon, 23 Aug 2004 03:10:32 -0700
- To: Martin Duerst <duerst@w3.org>
- CC: Jungshik Shin <jshin@i18nl10n.com>, www-international@w3.org
Martin Duerst wrote: > >If the document was going to be reparsed there would be less need for > >only ASCII-values to precede it. > > There is still quite a strong need for that. Immagine Shift_JIS, > or iso-2022-jp. Both are not ASCII-compatible in the sense you have > defined (which is exactly the "ASCII-valued bytes stand for ASCII > characters" in the text above). A parser can get completely out of > sync if e.g. the <title> is in Shift_JIS and the <meta> comes after > the <title>. I agree a parser can get out of sync. I agree that restricting the statements preceding the charset declaration to be ASCII makes sense. I agree that the spec never says the encoding changes in a document. However, I don't see anything in the spec for the case you are arguing- where the preceding statements are not ASCII and therefore should be reparsed. Actually, upon rereading this sentence: "The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed)." Could be taken to mean that if you need to or intend to use non-ascii data before the meta charset declaration, that you should not use a meta declaration at all and should instead use http. ("The META declaration MUST ONLY be used...") (And eliminating the need to reparse.) > >2) I don't follow your logic: > > > To take the above EUC-JP example, EUC-JP is ASCII-compatible as you > > > have defined. A <title> with Japanese text should not appear before > > > the <meta>, but such a case is not forbidden. And in that case, > > > the <title> has to be interpreted as EUC-JP; I don't see any > > > way to read the spec differently. > > > >Yes EUC-JP is ASCII-compatible. (Somewhat irrelevant though. The term was > >brought up to clarify Jungshik's remarks.) > > The term and the definition are relevant because they appear in the spec > (see above). The spec does not use "ASCII-compatible". It uses "ASCII" and "ASCII-valued". > See my mail. There are two instances at least where it says that the > <meta> says what the encoding of the *document* is. The document, not > just the part after the <meta>. This is extremely clear. There is > absolutely nothing in the HTML 4 spec (at least as far as I know, > and I was pretty involved in the relevant parts) that would even > suggest that a document can use more than one encoding. I agree the spec only discusses a "document encoding". I disagree it is clear that the document needs to be reparsed. FWIW, we have had several discussions in the i18n wg about the issue of not knowing the encoding until the meta statement is parsed and this is the first time I have heard of reparsing from the beginning. I imagine if it was clear, then Microsoft (assuming Bjoern's assertion is true) would not reparse just the current chunk and instead reparse from the beginning. Also, if it was clear that UAs would reparse from the beginning, the advice to move the meta statement to the top would be quite cogent, since it would now be a performance issue (which people would pay attention to). However, we are doing quite a bit of hair-splitting here and I think the original question was answered a while ago... tex
Received on Monday, 23 August 2004 10:11:40 UTC