Re: non sgml characters from Jukka K. Korpela on 2007-08-07 (www-validator@w3.org from August 2007)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Tue, 7 Aug 2007 13:56:27 +0300 (EEST)
To: Cristina Fiorentini <c.fiorentini@comune.fe.it>
cc: www-validator@w3.org
Message-ID: <Pine.SOC.4.64.0708071327290.16542@mustatilhi.cs.tut.fi>

On Tue, 7 Aug 2007, Cristina Fiorentini wrote:

> Ok scuse me,
> the address of one of my documents is 
> http://ww4.comune.fe.it/scuole/index.phtml?id=259 and the current validator, 
> from some days,  does not report error for word " ' " apostrophe.

Cristina,

thank you for the information. I'm taking the liberty of Cc'ing the 
validator list, since you seem to have encountered a problem in the 
current version of the validator. It's probably not a bug but might need 
some clarification in the documentation.

> I declare my pages XHTML 1.0 Strict -  iso-8859-1

I can reproduce the problem in a trivial test document
http://www.cs.tut.fi/~jkorpela/test/test.htmlx
that contains octet 146 (decimal), which is not reported by the 
W3C validator but is reported by the WDG validator. If I test with HTML 
4.01, such an octet is reported as an error, as before.

The problem appears both for XHTML 1.0 documents served as text/html and 
for them served as application/xhtml+xml.

I'm afraid this takes us deep into character problems. And I'm not sure I 
understand the issue well enough (even though I _should_; I've devoted 
several pages to the discussion of characters in markup languages in my 
book "Unicode Explained"...). But this is how things seem to be:

When you have octet 146 in a document declared to be iso-8859-1 encoded, 
it is interpreted as denoting a control code in the C1 Controls area. The 
meanings of those control codes have not been defined in the ISO 8859-1 
standard, but they correspond to the C1 Controls area of Unicode, so that 
e.g. 146 decimal (92 hexadecimal) maps to the Unicode character U+0092.
Such characters (code positions) are forbidden in HTML 4.01 (or any 
pre-XHTML version of HTML), so the validator correctly reports them as 
erroneous ("non SGML character"). However, in XML, and hence in XHTML, C1 
Controls like U+0092 are allowed, though discouraged. Formally, thus, they 
cannot be reported as errors.

> Can i declare my pages encoding as windows-1252?

Yes. (This changes the picture, since now e.g. octet 146 is interpreted 
according to the windows-1252 encoding, where it denotes a printable 
character.)

> It's not a problem for accessibility?

Hardly. Windows-1252 is widely supported by browsers, even on platforms 
other than Windows, simply because it is widely used on web pages
(though often with declarations that claim that the encoding is 
iso-8859-1).

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Tuesday, 7 August 2007 10:56:42 UTC