W3C home > Mailing lists > Public > www-validator@w3.org > June 2001

Non-SGML Char Refs

From: Thanasis Kinias <tkinias@asu.edu>
Date: Mon, 04 Jun 2001 13:37:35 -0700
To: "'www-validator@w3.org'" <www-validator@w3.org>
Cc: "'tkinias@optimalco.com'" <tkinias@optimalco.com>
Message-id: <A021872EC2BDD411AB3600902746A055016048D6@mainex4.asu.edu>
Greetings,

The validator complains about "non-SGML character" references (e.g., &#147; 
instead of the correct &#8220;) only when validating as XHTML.  That implies

that &#147; and the other Microsoft characters from decimal 128-159 (hex
80-9f) 
_are_ valid in HTML.

However, the HTML 4.01 spec [1] reads:

	Numeric character references specify the code position of a
character 
	in the document character set. 
	[...]
	The syntax "&#D;", where D is a decimal number, refers to the ISO
10646 
	decimal character number D.

The characters from decimal 128-159 are non-printing controls in
UCS/Unicode.  
From the SGML declaration of HTML4.01 [2]:

	CHARSET
          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

As I read that, it means that 32 chars starting decimal 128 are UNUSED.  
So the validator should flag an error on char refs like &#147; in HTML4 
as well as in XHTML.

For an example of a page which uses such invalid references under HTML4 
Transitional, but where the refs are not flagged as invalid, see [3].
The WDG validator [5], BTW, does catch this.

1. <http://www.w3.org/TR/html4/charset.html#h-5.3.1>
2. <http://www.w3.org/TR/html4/sgml/sgmldecl.html>
3. <http://my.asu.edu/>
4. <http://www.htmlhelp.com/tools/validator/>

Regards,

Thanasis Kinias
Information Dissemination Team, Information Technology
Arizona State University
Tempe, Ariz., U.S.A.

Qui nos rodunt confundantur
et cum iustis non scribantur.
Received on Monday, 4 June 2001 16:38:39 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:13:58 GMT