Re: faq suggestions from Tex Texin on 2004-08-23 (www-international@w3.org from July to September 2004)

From: Tex Texin <tex@i18nguy.com>
Date: Mon, 23 Aug 2004 03:10:32 -0700
To: Martin Duerst <duerst@w3.org>
CC: Jungshik Shin <jshin@i18nl10n.com>, www-international@w3.org
Message-ID: <4129C298.E1AB07E9@i18nguy.com>
Martin Duerst wrote:
> >If the document was going to be reparsed there would be less need for
> >only ASCII-values to precede it.
> 
> There is still quite a strong need for that. Immagine Shift_JIS,
> or iso-2022-jp. Both are not ASCII-compatible in the sense you have
> defined (which is exactly the "ASCII-valued bytes stand for ASCII
> characters" in the text above). A parser can get completely out of
> sync if e.g. the <title> is in Shift_JIS and the <meta> comes after
> the <title>.

I agree a parser can get out of sync.
I agree that restricting the statements preceding the charset declaration to be
ASCII makes sense.
I agree that the spec never says the encoding changes in a document.

However, I don't see anything in the spec for the case you are arguing- where
the preceding statements are not ASCII and therefore should be reparsed.

Actually, upon rereading this sentence:
"The META declaration must only be used when the character encoding is
organized such that ASCII-valued bytes stand for ASCII characters (at least
until the META element is parsed)."

Could be taken to mean that if you need to or intend to use non-ascii data
before the meta charset declaration, that you should not use a meta declaration
at all and should instead use http. ("The META declaration MUST ONLY be
used...")
(And eliminating the need to reparse.)

 
> >2) I don't follow your logic:
> > > To take the above EUC-JP example, EUC-JP is ASCII-compatible as you
> > > have defined. A <title> with Japanese text should not appear before
> > > the <meta>, but such a case is not forbidden. And in that case,
> > > the <title> has to be interpreted as EUC-JP; I don't see any
> > > way to read the spec differently.
> >
> >Yes EUC-JP is ASCII-compatible. (Somewhat irrelevant though. The term was
> >brought up to clarify Jungshik's remarks.)
> 
> The term and the definition are relevant because they appear in the spec
> (see above).

The spec does not use "ASCII-compatible". It uses "ASCII" and "ASCII-valued".

 
> See my mail. There are two instances at least where it says that the
> <meta> says what the encoding of the *document* is. The document, not
> just the part after the <meta>. This is extremely clear. There is
> absolutely nothing in the HTML 4 spec (at least as far as I know,
> and I was pretty involved in the relevant parts) that would even
> suggest that a document can use more than one encoding.

I agree the spec only discusses a "document encoding". I disagree it is clear
that the document needs to be reparsed.
FWIW, we have had several discussions in the i18n wg about the issue of not
knowing the encoding until the meta statement is parsed and this is the first
time I have heard of reparsing from the beginning. 

I imagine if it was clear, then Microsoft (assuming Bjoern's assertion is true)
would not reparse just the current chunk and instead reparse from the
beginning.
Also, if it was clear that UAs would reparse from the beginning, the advice to
move the meta statement to the top would be quite cogent, since it would now be
a performance issue (which people would pay attention to).


However, we are doing quite a bit of hair-splitting here and I think the
original question was answered a while ago...
tex
Received on Monday, 23 August 2004 10:11:40 UTC