Re: non sgml characters from olivier Thereaux on 2007-08-08 (www-validator@w3.org from August 2007)

From: olivier Thereaux <ot@w3.org>
Date: Wed, 8 Aug 2007 15:36:42 +0900
To: Jukka K.Korpela <jkorpela@cs.tut.fi>
Cc: Cristina Fiorentini <c.fiorentini@comune.fe.it>, www-validator@w3.org
Message-Id: <BBC889F1-95F5-4139-A25E-DB46A3E4A8A9@w3.org>

On Aug 7, 2007, at 19:56 , Jukka K. Korpela wrote:
> When you have octet 146 in a document declared to be iso-8859-1  
> encoded, it is interpreted as denoting a control code in the C1  
> Controls area. The meanings of those control codes have not been  
> defined in the ISO 8859-1 standard, but they correspond to the C1  
> Controls area of Unicode, so that e.g. 146 decimal (92 hexadecimal)  
> maps to the Unicode character U+0092.
> Such characters (code positions) are forbidden in HTML 4.01 (or any  
> pre-XHTML version of HTML), so the validator correctly reports them  
> as erroneous ("non SGML character"). However, in XML, and hence in  
> XHTML, C1 Controls like U+0092 are allowed, though discouraged.  
> Formally, thus, they cannot be reported as errors.

Exactly. This was (and still is to me, I can't claim to fully grasp  
it yet) a hairy issue which I think we settled in
http://www.w3.org/Bugs/Public/show_bug.cgi?id=3164

-- 
olivier

Received on Wednesday, 8 August 2007 06:36:01 UTC