RE: xml parsing error for < character from Evan Lenz on 2000-09-06 (html-tidy@w3.org from July to September 2000)

From: Evan Lenz <elenz@xyzfind.com>
Date: Wed, 6 Sep 2000 11:53:02 -0700
To: <html-tidy@w3.org>
Message-ID: <BNEMICIEADHDDOIKLHNCMEOKCDAA.elenz@xyzfind.com>
After further review, I dislike option #3.  The W3C advises against
continuing the practice of "hiding" scripts and stylesheets within comments.
Plus, even though I'm using a comment-aware XML parser, not everyone will
be.  For XHTML, I think Tidy's best option would be to strip any open and
close comment delimiters when they are right after the opening tag and
before the closing tag, respectively, and then replacing them with CDATA
section delimiters.  Tidy can then issue a warning that the script may not
work in current browsers and that perhaps the user should include the script
from another file.

Let's vote! :)

Evan Lenz
elenz@xyzfind.com
http://www.xyzfind.com
XYZFind Corp. "Building Better Search"


-----Original Message-----
From: html-tidy-request@w3.org [mailto:html-tidy-request@w3.org]On
Behalf Of Evan Lenz
Sent: Wednesday, September 06, 2000 11:20 AM
To: html-tidy@w3.org
Subject: RE: xml parsing error for < character



I, for one, am in favor of at least having an option for either 1) nesting
script content into CDATA sections, 2) escaping < to &lt; and & to &amp;, 3)
always nesting script contents in a comment and escaping --.  In short, I
don't care how it's done, but well-formedness is well-formedness.  One of
the primary advantages of (and reasons for) XHTML is the ability to do
further XML processing with languages such as XSLT.  Perhaps my situation is
unique, but I don't need Tidy's output to display correctly in a browser,
because it's just serving as input to my XSLT processor.  The XSLT processor
can then worry about the HTML output method, allowing the result to work in
current browsers.  So, ultimately, that isn't much of a solution for those
that want XHTML that will work in browsers as their final result, and, yes,
it's a cop-out to the real XHTML problem.  But my opinion is that if Tidy is
to advertise the ability to output XHTML (or any kind of well-formed XML),
it should do it (or at least try its best).

As for instances of the literal string ]]> in script content, that's easy.
Simply escape it with the following string: ]]]]><![CDATA[>

Okay, I didn't come up with that, but it certainly works (by breaking the
CDATA section into two).

Thanks for all the great work you've done so far!!

Evan Lenz
elenz@xyzfind.com
http://www.xyzfind.com
XYZFind Corp. "Building Better Search"


-----Original Message-----
From: html-tidy-request@w3.org [mailto:html-tidy-request@w3.org]On
Behalf Of Dave Raggett
Sent: Friday, September 01, 2000 3:21 AM
To: cateen
Cc: html-tidy@w3.org; gerald@w3.org
Subject: Re: xml parsing error for < character


On Tue, 22 Aug 2000, cateen wrote:

> Having used JTidy to produce an xml file, I then use a SAX based
> XML parser on it and get the following exceptions: 'Whitespace
> required before attributes.' caused by the following line in a
> javascript function:  for (i=0;i<y;i++) OR 'The content begining
> "<" is not legal markup.  Perhaps the "" (&#20;) character
> should be a letter.' caused by :  for ( i = 0; i < y; i++)
>
> Can anyone tell me why this is happening and if there is a
> configuration setting that will change the way the XML parser
> treats the '<' sign.  Alternatively is there a way that I can
> tell JTidy to simply strip out all the information between the
> <script> and </script> tags as it writes the xml file??

When I looked into this I discovered a bug in tidy.c lines 1075
and 1093 which should be testing XmlOut and not XmlTags. This
fix now causes < and & in script element content to be escaped.

This however doesn't solve the problem for XHTML. JavaScript uses
< and & both of which need to be escaped in XML content unless you
wrap the content in a CDATA marked section.

Unfortunately, current HTML browsers don't recognize CDATA marked
sections, and furthermore they expect script elements to have
CDATA content, and hence expect < and & to be unescaped.

The XHTML 1.0 standard therefore recommends you move your scripts
to external files. The onmouseover and other event attributes
are however ok, as browsers will deal with entities in attribute
values.

In the future as true XHTML browsers are deployed this problem
will go away, since when parsed as XML, you will be able to
use script elements with &lt; for < or with the use of an
enclosing CDATA marked section.

I am uncertain as what HTML Tidy should do about this problem.
If it wraps the contents of a script element in a CDATA marked
section, it will stop the pages working in existing browsers.
Ditto if it escapes the problem characters. It could write the
contents of the script element to a new file, but what file
name should it use?  One possibility is simply to warn if Tidy
finds < and & within script elements, placing the burden on
the user to decide how to deal with the problem.

What do people think about this?

Regards,

-- Dave Raggett <dsr@w3.org> http://www.w3.org/People/Raggett
tel/fax: +44 122 578 3011 (or 2521) +44 778 532 0444 (mobile)
World Wide Web Consortium (on assignment from HP Labs)
Received on Wednesday, 6 September 2000 14:50:32 UTC