Re: Problem in publishing multilingual HTML document on web in UTF-8 encoding

Philip TAYLOR wrote:

> This talks specifically about "ASCII-valued bytes", and says nothing at
> all about non-ASCII-valued bytes.

I went and read the section. It says:

> The META <http://www.w3.org/TR/html4/struct/global.html#edef-META> 
> declaration must only be used when the character encoding is organized 
> such that ASCII-valued bytes stand for ASCII characters (at least 
> until the META 
> <http://www.w3.org/TR/html4/struct/global.html#edef-META> element is 
> parsed).

So let's see. All encodings that are completely incompatible with ASCII, 
such as all EBCDIC variants, are out. You can't use the meta with them.
Now an interesting question raises itself. UTF-16 is mostly organized 
such that ASCII-valued bytes stand for ASCII characters. ('A' is x0043, 
and the x43 byte is indeed the ASCII value of 'A'.) How about the 
x0-bytes though? They don't stand for the ASCII NUL character. Instead, 
they are part of some other character.
My intuition tells me that this means that UTF-16 and all similar 
encodings are out and Ian's example is simply invalid. I'm not sure 
here, though.

But the real point of discussion was whether the character encoding can 
change. It cannot. There is only one encoding per document. Although the 
text does not state this explicitely, the wording builds on this 
assumption, for example from 5.2.2:

> How does a server determine which character encoding applies for a 
> document it serves?

Note the use of the singular.
The practical problem here is, what signals the change of the encoding? 
Is it the end of the meta element, or the end of the content attribute 
of the meta start tag? Since no such thing is specified, we can safely 
assume that the character encoding cannot change during the document.
Which does not mean the byte mapping cannot.
Non-ASCII bytes may appear in the stream prior to the meta (I was wrong 
here. A sensible implementation would be to store them for later 
translation), but ASCII bytes must have the ASCII meaning. This is what 
the phrasing of the text means. The part in the parentheses is about 
shift encodings such as Shift-JIS, which may, in the initial shift 
state, have ASCII mapping for ASCII bytes, but after a shift character, 
have a different mapping. The phrase in the parentheses permits such 
encodings, but no shift may come before the meta.

Sebastian Redl

Received on Saturday, 3 June 2006 15:33:56 UTC