- From: Jukka K. Korpela <jkorpela@cs.tut.fi>
- Date: Tue, 7 Aug 2007 13:56:27 +0300 (EEST)
- To: Cristina Fiorentini <c.fiorentini@comune.fe.it>
- cc: www-validator@w3.org
On Tue, 7 Aug 2007, Cristina Fiorentini wrote: > Ok scuse me, > the address of one of my documents is > http://ww4.comune.fe.it/scuole/index.phtml?id=259 and the current validator, > from some days, does not report error for word " ' " apostrophe. Cristina, thank you for the information. I'm taking the liberty of Cc'ing the validator list, since you seem to have encountered a problem in the current version of the validator. It's probably not a bug but might need some clarification in the documentation. > I declare my pages XHTML 1.0 Strict - iso-8859-1 I can reproduce the problem in a trivial test document http://www.cs.tut.fi/~jkorpela/test/test.htmlx that contains octet 146 (decimal), which is not reported by the W3C validator but is reported by the WDG validator. If I test with HTML 4.01, such an octet is reported as an error, as before. The problem appears both for XHTML 1.0 documents served as text/html and for them served as application/xhtml+xml. I'm afraid this takes us deep into character problems. And I'm not sure I understand the issue well enough (even though I _should_; I've devoted several pages to the discussion of characters in markup languages in my book "Unicode Explained"...). But this is how things seem to be: When you have octet 146 in a document declared to be iso-8859-1 encoded, it is interpreted as denoting a control code in the C1 Controls area. The meanings of those control codes have not been defined in the ISO 8859-1 standard, but they correspond to the C1 Controls area of Unicode, so that e.g. 146 decimal (92 hexadecimal) maps to the Unicode character U+0092. Such characters (code positions) are forbidden in HTML 4.01 (or any pre-XHTML version of HTML), so the validator correctly reports them as erroneous ("non SGML character"). However, in XML, and hence in XHTML, C1 Controls like U+0092 are allowed, though discouraged. Formally, thus, they cannot be reported as errors. > Can i declare my pages encoding as windows-1252? Yes. (This changes the picture, since now e.g. octet 146 is interpreted according to the windows-1252 encoding, where it denotes a printable character.) > It's not a problem for accessibility? Hardly. Windows-1252 is widely supported by browsers, even on platforms other than Windows, simply because it is widely used on web pages (though often with declarations that claim that the encoding is iso-8859-1). -- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
Received on Tuesday, 7 August 2007 10:56:42 UTC