W3C home > Mailing lists > Public > www-validator@w3.org > July 2005

ampersands, angle brackets, errors, warnings and xml

From: Marc Richards <contact_marcos@yahoo.es>
Date: Mon, 11 Jul 2005 11:28:15 -0400
Message-ID: <42D2900F.70207@yahoo.es>
To: www-validator@w3.org

Hi,

I have been doing some research about the need for CDATA sections in 
xhtml pages and I have run into a couple of areas that are a bit unclear 
so I was hoping some of the validator folks might have some answers.

I am already aware of certain w3c validator issues like warnings being 
accidentally suppressed[1] and the fact that the validator "has some 
limitations" with regards to CML.  In general I have done all my 
testing[2] against the development version of the validator[3].

So here are my questions:

1) Should the validator be throwing an error instead of a warning 
whenever it encounters an ampersand or left angle bracket as data for a 
document served as application/xhtml+xml? i.e. was there a conscious 
decision made to only throw a warning or is this simply one of the XML 
parser limitations.

If this *is* one of the XML limitations then I think it would be helpful 
to compile a short list of common limitations and list them on a w3c 
page in plain English.  I have read the OpenSP page[4] a couple times 
and I am still not sure whether or not recognizing "<" and "&" as 
invalid is a limitation of the parser; The language on that page is 
fairly technical.  The validator could link to this internal page 
directly and that page would then link to the OpenSP page as well.

If this isn't a parser limitation, is there a bug number open for 
recognizing these two characters as errors?  I would like to add myself 
to the CC list.


2) Why are you issuing a warning for the use of ampersands and let angle 
brackets in xhtml but not html.  If the warning is in fact saying "this 
may be valid in some contexts, but it is recommended to use &amp; or 
&lt;" then this is an SGML warning and should be shown for both HTML and 
XHTML as text/html.  Ideally with and example like "R & D valid, R&D 
invalid".  Is there a bug open for issuing the warning for html doctypes 
as well?


Here is my current understanding on the validity of various 
doctypes/content types along with how I think the validator should 
react. They should correspond to my validation test cases[2]


ampersand and left bracket inside the body as data.
---------------------------------------------------
HTML4 - warning: you are a few whitespaces away form an invalid page
XHTML1 as HTML - warning: you are a few white spaces away form an 
invalid page
XHTML1 as XML - error: this is xml fool!


ampersand and left bracket inside the body as data. oops, no spaces.
--------------------------------------------------------------------
HTML4 - error: unrecognized entity and unrecognized tag
XHTML1 as HTML - error: unrecognized entity and unrecognized tag
XHTML1 as XML - error: unrecognized entity and unrecognized tag


ampersand and left bracket inside the script tag as data.
---------------------------------------------------------
HTML4 - no problem: the script tag is CDATA in the HTML4 DTD
XHTML1 as HTML - warning: you are a few whitspaces away form an invalid 
page. the script tag is PCDATA in the XHTML1 DTD
XHTML1 as XML - error: this is xml fool!


ampersand and left bracket inside the script tag as data. oops, no spaces.
--------------------------------------------------------------------
HTML4 - no problem: the script tag is CDATA in the HTML4 DTD
XHTML1 as HTML - error: unrecognized entity and unrecognized tag
XHTML1 as XML - error: unrecognized entity and unrecognized tag


ampersand and left bracket inside the script tag as data w/ the CDATA tag
----------------------------------------------------------------------
HTML4 - no problem: no harm in a redundant CDATA section
XHTML1 as HTML - no problem: CDATA section to the rescue
XHTML1 as XML - no problem: CDATA section to the rescue


ampersand and left bracket inside the script tag as data w/ the CDATA 
tag. oops, no spaces.
------------------------------------------------------------------------
HTML4 - no problem: no harm in a redundant CDATA section
XHTML1 as HTML - no problem: CDATA section to the rescue
XHTML1 as XML - no problem: CDATA section to the rescue


Are these scenarios correct/ideal? Are there open bugs you can point me 
to? Are there bugs I should file?

Thanks.


Marc


[1] http://www.w3.org/Bugs/Public/show_bug.cgi?id=798
[2] http://mulberry.swarthmore.edu/validation-tests/
[3] http://validator.w3.org:8001/
[4] http://openjade.sourceforge.net/doc/xml.htm
Received on Monday, 11 July 2005 15:28:19 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 25 April 2012 12:14:19 GMT