W3C home > Mailing lists > Public > html-tidy@w3.org > October to December 2001

Error Behaviour in parsing HTML with JTidy

From: Carlos Ruiz-Capillas <carlos.ruizcapillas@newknow.com>
Date: Thu, 13 Dec 2001 15:53:48 +0100
Message-ID: <FF2B76E6B7DB08469C3E8E96E3F3BC3F8B2125@skywalker.newknow.com>
To: "'html-tidy@w3.org'" <html-tidy@w3.org>
Hi,

I am trying to identify text that belongs to a script node using 4.Aug.2000
JTidy version.


Consider Parsing the below HTML.It dosen't work well.

<html><head></head>
<body bgcolor="#4B4B4B" onLoad="ieFlash();">

<div id="topLayerLeft">
<table valign="top" cellpadding="0" cellspacing="0" border="0" width="324"> 

<script>
var str=tll_links("yes","<img src=\"/cnet_news/template/link_test.gif\"
border=\"0\" alt=\"Test your connection speed\">","Steve Ballmer, CEO,
Microsoft: Ballmer talks up XML,
.Net","t031201_1930","0","324733","%2A","cnet_news","http://video.cnet.com:8
0/cgi-bin/visearch?user=","Analyst test drives Office XP
(3/1/01)","t030101_1330","<img src=\"http://www.cnet.com/i/gl/vid-b.gif\"
width=\"24\" height=\"18\" hspace=\"0\" align=\"top\"
border=\"0\">","loasf","&value=default&which=1&old=yes&hdr=news_vid_hed.gif"
,"http://video.cnet.com:80/cnet_news/template/asxgen.cgi?",
"cpcode=674&asset=http://cnetnews.download.akamai.com/674/","ccstart=2000&cc
stop=302666","Microsoft gaining ground at trial
(3/2/01)","t030201_0830","ccstart=2000&ccstop=475166");document.write(str);
</script>

</table>
</div>

</body>
</html>


The text Nodes of the DOM representation (calling to org.w3c.dom.Document
parseDOM(InputStream in, OutputStream out) method)  are:

TEXT NODE: CNET News.com

TEXT NODE: var str=tll_links("yes","   

TEXT NODE: ","Steve Ballmer, CEO, Microsoft: Ballmer talks up XML,
.Net","t031201_1930","0","324733","%2A","cnet_news","http://video.cnet.com:8
0/cgi-bin/visearch?user=","Analyst test drives Office XP
(3/1/01)","t030101_1330","

TEXT NODE:
","loasf","&value=default&which=1&old=yes&hdr=news_vid_hed.gif","http://vide
o.cnet.com:80/cnet_news/template/asxgen.cgi?",
"cpcode=674&asset=http://cnetnews.download.akamai.com/674/","ccstart=2000&cc
stop=302666","Microsoft gaining ground a
t trial
(3/2/01)","t030201_0830","ccstart=2000&ccstop=475166");document.write(str);

and if you ask for the last three nodes parent name the value return is:
DIV


Why is not recognized the tag <SCRIPT>?
Is there any way to identify the tag <SCRIPT> or a new version that fixes
this case?




Thanks,
Charlie.




_____________________________________
Carlos Ruiz-Capillas Zarranz
Software Engineer
Newknow Network S.A.
mailto: capillas@newknow.com
Direct Phone: 91 639 89 50
Main Phone: 91 639 90 00
Fax: 91 638 71 59


This message and any files transmitted with it are confidential and intended
solely for the use of the individual or entity to whom they are addressed.
No confidentiality or privilege is waived or lost by any wrong transmission.
If you have received this message in error, please immediately destroy it
and kindly notify the sender by reply email.
You must not, directly or indirectly, use, disclose, distribute, print, or
copy any part of this message if you are not the intended recipient.
Opinions, conclusions and other information in this message that do not
relate to the official business of Newknow shall be understood as neither
given nor endorsed by it. 
Received on Thursday, 13 December 2001 09:54:02 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:47 GMT