- From: Martin Duerst <duerst@w3.org>
- Date: Mon, 23 Aug 2004 16:48:00 +0900
- To: Tex Texin <tex@i18nguy.com>
- Cc: Jungshik Shin <jshin@i18nl10n.com>, www-international@w3.org
Hello Tex, At 00:06 04/08/23 -0700, Tex Texin wrote: >Konnichiwa Martin, > >1) I wrote my last mail as you wrote yours and the supporting statement was in >that message. > >http://www.w3.org/TR/html401/charset.html#h-5.2.2 > >"The META declaration must only be used when the character encoding is >organized such that ASCII-valued bytes stand for ASCII characters (at least >until the META element is parsed). META declarations should appear as early as >possible in the HEAD element." > >If the document was going to be reparsed there would be less need for >only ASCII-values to precede it. There is still quite a strong need for that. Immagine Shift_JIS, or iso-2022-jp. Both are not ASCII-compatible in the sense you have defined (which is exactly the "ASCII-valued bytes stand for ASCII characters" in the text above). A parser can get completely out of sync if e.g. the <title> is in Shift_JIS and the <meta> comes after the <title>. >2) I don't follow your logic: > > To take the above EUC-JP example, EUC-JP is ASCII-compatible as you > > have defined. A <title> with Japanese text should not appear before > > the <meta>, but such a case is not forbidden. And in that case, > > the <title> has to be interpreted as EUC-JP; I don't see any > > way to read the spec differently. > >Yes EUC-JP is ASCII-compatible. (Somewhat irrelevant though. The term was >brought up to clarify Jungshik's remarks.) The term and the definition are relevant because they appear in the spec (see above). >However, if the User Agent has made some presumption of the encoding due >to the >lack of an http charset declaration, then the title would be interpreted in >that encoding. I don't see why the paragraph you excerpted requires it to be >interpreted as euc-jp. >(But it would be nice.) See my mail. There are two instances at least where it says that the <meta> says what the encoding of the *document* is. The document, not just the part after the <meta>. This is extremely clear. There is absolutely nothing in the HTML 4 spec (at least as far as I know, and I was pretty involved in the relevant parts) that would even suggest that a document can use more than one encoding. Regards, Martin. >tex > > >Martin Duerst wrote: > > > > Hello Tex, > > > > At 19:36 04/08/22 -0700, Tex Texin wrote: > > > > >Hi Jungshik, > > > > > >With respect to user agents reparsing documents from the beginning, can > > >you say > > >which ones do this? > > >They are not obligated to and the wording of the standards implies > that the > > >encoding "switch" from the initial value to the value specified in the > charset > > >statement, occurs at the point the statement is parsed. > > > > Can you point to some place that supports that statement? > > > > At http://www.w3.org/TR/html401/charset.html#h-5.2.2, I find: > > > > > To address server or configuration limitations, HTML documents may > > > include explicit information about the document's character encoding; > > > the META element can be used to provide user agents with this > information. > > > > This says "the document's character encoding", nothing about points > > after. > > > > > For example, to specify that the character encoding of the current > > > document > > > > This again says "character encoding of the current *document*". > > > > > is "EUC-JP", a document should include the following META > > > declaration: > > > > > > <META http-equiv="Content-Type" content="text/html; charset=EUC-JP"> > > > > > > The META declaration must only be used when the character encoding is > > > organized such that ASCII-valued bytes stand for ASCII characters (at > > > least until the META element is parsed). META declarations should appear > > > as early as possible in the HEAD element. > > > > To take the above EUC-JP example, EUC-JP is ASCII-compatible as you > > have defined. A <title> with Japanese text should not appear before > > the <meta>, but such a case is not forbidden. And in that case, > > the <title> has to be interpreted as EUC-JP; I don't see any > > way to read the spec differently. > > > > Regards, Martin. > > > > >On a separate point I wonder if you meant ASCII-compatible or simply > ASCII. > > >If the text prior to the charset statement consists of only ASCII > characters, > > >then yes, the later position of the charset statement is moot. But if the > > >statements preceding the charset statement contain non-ASCII > characters in an > > >ASCII-compatible encoding, if the user agent doesn't reparse from the > > >beginning, then itmaymisinterpretthecontentofthosestatements. > > > > > >(To clarify, to e an ASCII-compatible encoding is one that assigns the > same > > >characters as the ASCII character set does to the values 0-127, and then > > >assigns additional characters to values greater than 127.) > > > > > >tex > > > > > >Jungshik Shin wrote: > > > > > > > Tex Texin wrote: > > > > > Otherwise text in the page prior to the charset statement may not be > > > decoded > > > > > correctly. > > > > > > > > However, as long as the encoding used is ASCII-compatible, it doesn't > > > > matter much. I believe most user 'agents' look for 'meta' declaration > > > > for charset and reparse the document from the beginning after > > > > determining the encoding (assuming http C-T header doesn't have charset > > > > parameter) > > > > > > > > Jungshik > > > > > >-- > > >------------------------------------------------------------- > > >Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com > > >Xen Master http://www.i18nGuy.com > > > > > >XenCraft http://www.XenCraft.com > > >Making e-Business Work Around the World > > >------------------------------------------------------------- > >-- >------------------------------------------------------------- >Tex Texin cell: +1 781 789 1898 mailto:Tex@XenCraft.com >Xen Master http://www.i18nGuy.com > >XenCraft http://www.XenCraft.com >Making e-Business Work Around the World >-------------------------------------------------------------
Received on Monday, 23 August 2004 09:29:53 UTC