Re: xml parsing error for < character from Dave Raggett on 2000-09-01 (html-tidy@w3.org from July to September 2000)

From: Dave Raggett <dsr@w3.org>
Date: Fri, 1 Sep 2000 11:21:11 +0100 (GMT Daylight Time)
To: cateen <cateen@turboweb.net.au>
cc: html-tidy@w3.org, gerald@w3.org
Message-ID: <Pine.WNT.4.10.10009011056180.-405663@hazel.hpl.hp.com>

On Tue, 22 Aug 2000, cateen wrote:

> Having used JTidy to produce an xml file, I then use a SAX based
> XML parser on it and get the following exceptions: 'Whitespace
> required before attributes.' caused by the following line in a
> javascript function:  for (i=0;i<y;i++) OR 'The content begining
> "<" is not legal markup.  Perhaps the "" (&#20;) character
> should be a letter.' caused by :  for ( i = 0; i < y; i++)
> 
> Can anyone tell me why this is happening and if there is a
> configuration setting that will change the way the XML parser
> treats the '<' sign.  Alternatively is there a way that I can
> tell JTidy to simply strip out all the information between the
> <script> and </script> tags as it writes the xml file??

When I looked into this I discovered a bug in tidy.c lines 1075
and 1093 which should be testing XmlOut and not XmlTags. This
fix now causes < and & in script element content to be escaped.

This however doesn't solve the problem for XHTML. JavaScript uses
< and & both of which need to be escaped in XML content unless you
wrap the content in a CDATA marked section.

Unfortunately, current HTML browsers don't recognize CDATA marked
sections, and furthermore they expect script elements to have
CDATA content, and hence expect < and & to be unescaped.

The XHTML 1.0 standard therefore recommends you move your scripts
to external files. The onmouseover and other event attributes
are however ok, as browsers will deal with entities in attribute
values.

In the future as true XHTML browsers are deployed this problem
will go away, since when parsed as XML, you will be able to
use script elements with &lt; for < or with the use of an
enclosing CDATA marked section.

I am uncertain as what HTML Tidy should do about this problem.
If it wraps the contents of a script element in a CDATA marked
section, it will stop the pages working in existing browsers.
Ditto if it escapes the problem characters. It could write the
contents of the script element to a new file, but what file
name should it use?  One possibility is simply to warn if Tidy
finds < and & within script elements, placing the burden on
the user to decide how to deal with the problem.

What do people think about this?

Regards,

-- Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
tel/fax: +44 122 578 3011 (or 2521) +44 778 532 0444 (mobile)
World Wide Web Consortium (on assignment from HP Labs)

Received on Friday, 1 September 2000 06:21:21 UTC