Re: Unencoded Ampersands from Jukka K. Korpela on 2004-09-03 (www-validator@w3.org from September 2004)

From: Jukka K. Korpela <jkorpela@cs.tut.fi>
Date: Fri, 3 Sep 2004 23:58:56 +0300 (EEST)
To: Charl van Niekerk <charlvn@gmail.com>
Cc: www-validator@w3.org
Message-ID: <Pine.GSO.4.58.0409032343190.25253@korppi.cs.tut.fi>

On Thu, 2 Sep 2004, Charl van Niekerk wrote:

> I thought having unencoded ampersands is illegal in XML.

As far as I can see, they are indeed disallowed, even by well-formedness
rules, since the definition of CharData (which effectively tells what is
allowed outside tags and character and entity references, so to say) at
http://www.w3.org/TR/REC-xml/#NT-CharData says:
CharData    ::=    [^<&]* - ([^<&]* ']]>' [^<&]*)
thereby excluding "<" and "&", in accordance with prose descriptions in
the XML specification.

> However, the
> validator only lists them under "warnings". How so? Shouldn't any
> conforming XML parser crash on those?

Well, not crash. :-) But even non-validating processors are required to
check well-formedness.

My guess is that the markup validator has been built upon a genetic SGML
validator, just with some tuning, which doesn't cover this issue. The
recognition of a naked "&" was apparently added ad hoc, and it was
probably easier to make it issue a warning than an error.

> Also, I thought unencoded ampersands is illegal in HTML too.

No, they aren't, since SGML rules apply. An ampersand need not be escaped
(though it has always been good practice in HTML to escape it), except
when it could otherwise start an entity reference or a character
reference. Thus, "R&D" is incorrect (&D must be parsed as an entity
reference, and the entity is undefined), whereas "R & D" is formally OK.

-- 
Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/

Received on Friday, 3 September 2004 20:59:29 UTC