Re: Character Encoding Problem from Jukka K. Korpela on 2005-03-22 (www-validator@w3.org from March 2005)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 22 Mar 2005 22:40:03 +0200 (EET)
To: Michel CARRARE <mc@michelcarrare.com>
Cc: www-validator@w3.org
Message-ID: <Pine.GSO.4.58.0503222226020.29126@korppi.cs.tut.fi>

On Tue, 22 Mar 2005, Michel CARRARE wrote:

> I have a little problem with character encoding. One of my web pages:
>
> 	http://www.michelcarrare.com/multimedia/table-car.php
>
> contains a table of all 8-bit characters.

It contains incorrect information. There is a rich supply of tables
of "8-bit characters", some of them correct, some not. I wouldn't
mention this (after all, we all try to reinvent the wheel at times),
but it is directly connected with the validation problems.

> When validating this page, I have
> warnings coresponding to reserved characters, which is absolutely normal.

No, the warnings are about character references like &#128;, which are
technically _undefined_ (not reserved). And the warnings are indeed
useful. Here they imply that the page contains bogus information.
Whatever gets rendered when you use &#128; is just error processing by a
browser.

> Here is my problem. I thought only characters from 128 to 159 were
> reserved.

They are not reserved. And character encoding is not the issue here.
The reference &#128; is undefined, no matter what the encoding is.

> But, apparently, the validator sends me warnings for characters
> from 127 to 159. Could anyone tell me if character 127 is reserved or not.
> I could not find this information. I mean, I found both answers!

The authoritative answer is in the SGML declaration for HTML 4.01:

         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

  http://www.w3.org/TR/html4/sgml/sgmldecl.html

Thus, code position 127 is UNUSED in the document character set (which
does _not_ depend on the character encoding you use), and hence &#127;
is undefined too.

What puzzles me is this: When I tried to validate your page using the
extended interface (to get the source listed with line numbers),
http://validator.w3.org/check?uri=http%3A%2F%2Fwww.michelcarrare.com%2Fmultimedia%2Ftable-car.php&charset=%28detect+automatically%29&doctype=%28detect+automatically%29&ss=1&verbose=1
I get just "This Page Is Valid HTML 4.01 Transitional!" with no warnings!
Apparently this interface switches off the warnings. But there isn't even
any obvious way to switch them on there.

(Clarification: The page is valid, i.e. does not contain any reportable
markup error, but it is seriously wrong still. Using an undefined
character reference is all wrong especially on the Web. It's like
using 0/0 in mathematics: it is a syntactically correct expression
but lacks defined meaning, and anything may happen.)

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Tuesday, 22 March 2005 20:40:37 UTC