W3C home > Mailing lists > Public > html-tidy@w3.org > January to March 2002

BUG parsing HTML

From: Tabieriu Tatar <ttatar@commercehub.com>
Date: Tue, 8 Jan 2002 13:45:29 -0500 (EST)
Message-ID: <3C3B3DD1.AE530CA0@commercehub.com>
To: html-tidy@w3.org
I incounter the following issues:

<HTML>
<BODY>
 <font size="2" face="Verdana"><a href="testpage.html"><img
src="../../rightarrow.jpg"></a></font>
</BODY>
</HTML>

In the HTML above I can find in the DOM the IMG tag I want (the tag with
the src attribute containing the string "rightarrow.jpg"). When I try to
get the parent of it I receive the FONT tag instead of the anchor tag.
>From the othe point of view, I can find the anchor tag on this page, but
when I want to get the children of the anchor tag I receive null.

The code I used for sharching the IMG TAG is the following:

   Document doc = response.getDOM();

   String tagName = "img";
   NodeList nList = doc.getElementsByTagName(tagName);
   int listLength = nList.getLength();
   for ( int i = 0; i < listLength; i++ ) {
        Node imgNode = nList.item(i);
        NamedNodeMap allAttributes = imgNode.getAttributes();
        Node srcNode = allAttributes.getNamedItem("src");
        String srcNodeValue = srcNode.getNodeValue();
        System.out.println("\nsrcNodeValue =" + srcNodeValue);  // IT
PRINTS:  srcNodeValue =../../rightarrow.jpg

        String rightArrow = "rightarrow.jpg";
        int index = srcNodeValue.indexOf(rightArrow);
        if ( index > 0 ) {
             String currentNodeName = imgNode.getNodeName();
             System.out.println("\n current node name is : " +
currentNodeName);  // IT PRINTS: current node name is : img

             Node parent = imgNode.getParentNode();
             String parentNodeName = parent.getNodeName();
             System.out.println("\nparentNodeName = " + parentNodeName);
// IT PRINTS: parentNodeName = font INSTEAD OF "a"(ANCHOR TAG)


             NodeList children = parent.getChildNodes();
             int childrenLength = children.getLength();
             for ( int j = 0; j < childrenLength; j++ ) {
                  Node childNode = children.item(j);
                  String childNodeName = childNode.getNodeName();
                  System.out.println("\n children of current tag : " +
childNodeName ); // IT PRINTS: children of current tag : img
         }

    }

The code I used for sharching the IMG TAG is the following:

   Document doc = response.getDOM();
   String tagName = "a";
   NodeList nList = doc.getElementsByTagName(tagName);
   int listLength = nList.getLength();
   System.out.println("\n listLength = " + listLength);   // IT PRINTS:
listLength = 1
   for ( int i = 0; i < listLength; i++ ) {
        Node anchorNode = nList.item(i);
        NamedNodeMap allAtts = anchorNode.getAttributes();
        Node hrefNode = allAtts.getNamedItem("href");

        String hrefNodeValue = hrefNode.getNodeValue();
        int index = hrefNodeValue.indexOf("testpage.html");
        System.out.println("\nindex = " + index);             // IT
PRINTS: index = 0
        if ( index >= 0 ) {
             System.out.println("\n hrefNodeValue = " + hrefNodeValue);
// IT PRINTS: hrefNodeValue = testpage.html
             NodeList children = hrefNode.getChildNodes();
             System.out.println("\nchildren = " +
children);                             // IT PRINTS: children = null
INSTEAD OF A NodeList WHICH WOULD CONTAIN AN  IMG TAG
             int childrenLength = children.getLength();
             System.out.println("\nchildrenLength = " + childrenLength);

             for ( int j = 0; j < childrenLength; j++ ) {
                  Node childNode = children.item(j);
                  String childNodeName = childNode.getNodeName();
                  System.out.println("\n children of Anchor tag : " +
childNodeName );
             }
        }
    }

WHEN EACH TAG, IN HTML SOURCE, IS ON A DIFFERENT LINE EVERYTHING WORKS
GOOD. (the parent of the IMG tag is the anchor and the child of the
anchor is the IMG tag)

<HTML>
<BODY>
     <font size="2" face="Verdana">
          <a href="testpage.html">
               <img src="../../rightarrow.jpg">
          </a>
     </font>
</BODY>
</HTML>
Received on Wednesday, 9 January 2002 17:06:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 3 April 2012 06:13:51 GMT