W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2001

Re: Erroneous 'unescaped &' warning message from CGI urls

From: Rick Jelliffe <ricko@gate.sinica.edu.tw>
Date: Tue, 20 Feb 2001 19:24:19 +0800 (CST)
To: html-tidy <html-tidy@w3.org>
Message-ID: <Pine.GSO.4.21.0102201852590.17019-100000@gate>
On Tue, 20 Feb 2001, Martyn J Shaw wrote:

> I've seen this argument before somewhere.  Is the full answer that 
> the HTML 4.01 spec says that href attribute takes a URI as an value and
> a URI is of type CDATA and that a user agent should replace 
> character entities in CDATA sections?

It is useful to be able to put character references in attributes
values, because it helps multilingual and non-English HTML.  

The SGML (RCS) and XML rules are that any attribute value which
should have the characters "&" followed by a legitimate SGML (RCS) 
namestart character (a-zA-Z.-) must be marked up to prevent false
delimiter recognition. 

XML goes further than default (RCS) SGML, in that in XML _all_ occurrences
of "&" must be entered by character reference (&amp; or the numeric
character reference.)

HTML is supposed to follow SGML rules: normally there is no problem
because attributes values are often keywords or numbers which don't use
"&". However, HTML parsers are likely to do anything, which is why XML is
so strict in its requirements. From an SGML perspective, it is probably
best to say that HTML processors have a particular error-recovery strategy
for handling spurious references: they transmit the text of the references
as is. (I think HTML is as much an error-recovery strategy as it is a
particular language :-) However, this approach is not suitable for XML
(and so XHTML) because in XML parsing is not element-dependent but is
separate from the DTD.

So it is not really anything to do with the attribute being CDATA or not.

Actually, the terminology of CDATA is confusing for people: I have heard
Charles Goldfarb (inventor of SGML and generalized markup) say it would be
better to rename CDATA sections and CDATA elements "CLEARTEXT" rather than
CDATA, because they don't recognise any markup (other than the close
delimiters as appropriate) while CDATA attributes must have attribute
references recognised (as all attributes must.)  (In SGML but not XML we
have a declared content type for elements RCDATA which allows references
but not comments, PIs, start-tags. CDATA attributes act like RCDATA
elements...)

Hope this is some use.

Cheers
Rick Jelliffe
Received on Tuesday, 20 February 2001 06:24:26 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:45 GMT