Re: faq suggestions

Hello Tex,

At 00:06 04/08/23 -0700, Tex Texin wrote:

>Konnichiwa Martin,
>
>1) I wrote my last mail as you wrote yours and the supporting statement was in
>that message.
>
>http://www.w3.org/TR/html401/charset.html#h-5.2.2
>
>"The META declaration must only be used when the character encoding is
>organized such that ASCII-valued bytes stand for ASCII characters (at least
>until the META element is parsed). META declarations should appear as early as
>possible in the HEAD element."
>
>If the document was going to be reparsed there would be less need for
>only ASCII-values to precede it.

There is still quite a strong need for that. Immagine Shift_JIS,
or iso-2022-jp. Both are not ASCII-compatible in the sense you have
defined (which is exactly the "ASCII-valued bytes stand for ASCII
characters" in the text above). A parser can get completely out of
sync if e.g. the <title> is in Shift_JIS and the <meta> comes after
the <title>.


>2) I don't follow your logic:
> > To take the above EUC-JP example, EUC-JP is ASCII-compatible as you
> > have defined. A <title> with Japanese text should not appear before
> > the <meta>, but such a case is not forbidden. And in that case,
> > the <title> has to be interpreted as EUC-JP; I don't see any
> > way to read the spec differently.
>
>Yes EUC-JP is ASCII-compatible. (Somewhat irrelevant though. The term was
>brought up to clarify Jungshik's remarks.)

The term and the definition are relevant because they appear in the spec
(see above).


>However, if the User Agent has made some presumption of the encoding due 
>to the
>lack of an http charset declaration, then the title would be interpreted in
>that encoding. I don't see why the paragraph you excerpted requires it to be
>interpreted as euc-jp.
>(But it would be nice.)

See my mail. There are two instances at least where it says that the
<meta> says what the encoding of the *document* is. The document, not
just the part after the <meta>. This is extremely clear. There is
absolutely nothing in the HTML 4 spec (at least as far as I know,
and I was pretty involved in the relevant parts) that would even
suggest that a document can use more than one encoding.

Regards,     Martin.


>tex
>
>
>Martin Duerst wrote:
> >
> > Hello Tex,
> >
> > At 19:36 04/08/22 -0700, Tex Texin wrote:
> >
> > >Hi Jungshik,
> > >
> > >With respect to user agents reparsing documents from the beginning, can
> > >you say
> > >which ones do this?
> > >They are not obligated to and the wording of the standards implies 
> that the
> > >encoding "switch" from the initial value to the value specified in the 
> charset
> > >statement, occurs at the point the statement is parsed.
> >
> > Can you point to some place that supports that statement?
> >
> > At http://www.w3.org/TR/html401/charset.html#h-5.2.2, I find:
> >
> >  > To address server or configuration limitations, HTML documents may
> >  > include explicit information about the document's character encoding;
> >  > the META element can be used to provide user agents with this 
> information.
> >
> > This says "the document's character encoding", nothing about points
> > after.
> >
> >  > For example, to specify that the character encoding of the current
> >  > document
> >
> > This again says "character encoding of the current *document*".
> >
> >  > is "EUC-JP", a document should include the following META
> >  > declaration:
> >  >
> >  > <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
> >  >
> >  > The META declaration must only be used when the character encoding is
> >  > organized such that ASCII-valued bytes stand for ASCII characters (at
> >  > least until the META element is parsed). META declarations should appear
> >  > as early as possible in the HEAD element.
> >
> > To take the above EUC-JP example, EUC-JP is ASCII-compatible as you
> > have defined. A <title> with Japanese text should not appear before
> > the <meta>, but such a case is not forbidden. And in that case,
> > the <title> has to be interpreted as EUC-JP; I don't see any
> > way to read the spec differently.
> >
> > Regards,    Martin.
> >
> > >On a separate point I wonder if you meant ASCII-compatible or simply 
> ASCII.
> > >If the text prior to the charset statement consists of only ASCII 
> characters,
> > >then yes, the later position of the charset statement is moot. But if the
> > >statements preceding the charset statement contain non-ASCII 
> characters in an
> > >ASCII-compatible encoding, if the user agent doesn't reparse from the
> > >beginning, then        itmaymisinterpretthecontentofthosestatements.
> > >
> > >(To clarify, to e an ASCII-compatible encoding is one that assigns the 
> same
> > >characters as the ASCII character set does to the values 0-127, and then
> > >assigns additional characters to values greater than 127.)
> > >
> > >tex
> > >
> > >Jungshik Shin wrote:
> > >
> > > > Tex Texin wrote:
> > > > > Otherwise text in the page prior to the charset statement may not be
> > > decoded
> > > > > correctly.
> > > >
> > > > However, as long as the encoding used is ASCII-compatible, it doesn't
> > > > matter much. I believe most user 'agents' look for 'meta' declaration
> > > > for charset and reparse the document from the beginning after
> > > > determining the encoding (assuming http C-T header doesn't have charset
> > > > parameter)
> > > >
> > > > Jungshik
> > >
> > >--
> > >-------------------------------------------------------------
> > >Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
> > >Xen Master                          http://www.i18nGuy.com
> > >
> > >XenCraft                            http://www.XenCraft.com
> > >Making e-Business Work Around the World
> > >-------------------------------------------------------------
>
>--
>-------------------------------------------------------------
>Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
>Xen Master                          http://www.i18nGuy.com
>
>XenCraft                            http://www.XenCraft.com
>Making e-Business Work Around the World
>-------------------------------------------------------------

Received on Monday, 23 August 2004 09:29:53 UTC