checklink: erroneously parses scripts? from Sebastian Kuzminsky on 2013-01-21 (www-validator@w3.org from January 2013)

From: Sebastian Kuzminsky <seb@highlab.com>
Date: Mon, 21 Jan 2013 08:08:51 -0700
To: www-validator@w3.org
Message-ID: <50FD5A03.7040308@highlab.com>

Hi folks, thanks for the wonderful checklink service!  I've started 
using it extensively to find broken links in our documentation webpages, 
and it's been very helpful.

However, I have run in to what seems like a bug.

Our docs are generated by asciidoc, and asciidoc has a javascript-based 
mechanism for handling footnotes.  The javascript code in our html docs 
triggers a broken fragment error under checlink 4.81:

http://validator.w3.org/checklink?uri=http%3A%2F%2Fhighlab.com%2F~seb%2Femc2%2Fbad-links%2Fgcode%2Fgcode.html&hide_type=all&depth=&check=Check

checklink reports three broken fragments in that document.  The first 
one is a geniune bug in our docs 
(http://highlab.com/~seb/emc2/bad-links/gcode/gcode.html/#sec:Path-Control-Mode/ 
(line 2848)), but the other two i think are errors in checklink.

The lines in the document that checklink is warning about look like 
this. deep inside a <script type="text/javascript"> element:

footnotes: function () {
   var cont = document.getElementById("content");
   var noteholder = document.getElementById("footnotes");
   var spans = cont.getElementsByTagName("span");
   var refs = {};
   var n = 0;
   for (i=0; i<spans.length; i++) {
     if (spans[i].className == "footnote") {
       n++;
       // Use [\s\S] in place of . so multi-line matches work.
       // Because JavaScript has no s (dotall) regex flag.
       note = spans[i].innerHTML.match(/\s*\[([\s\S]*)]\s*/)[1];
       noteholder.innerHTML +=
         "<div class='footnote' id='_footnote_" + n + "'>" +
         "<a href='#_footnoteref_" + n + "' title='Return to text'>" +
         n + "</a>. " + note + "</div>";
       spans[i].innerHTML =
         "[<a id='_footnoteref_" + n + "' href='#_footnote_" + n +
         "' title='View footnote' class='footnote'>" + n + "</a>]";
       var id =spans[i].getAttribute("id");
       if (id != null) refs["#"+id] = n;
     }
   }
   if (n == 0)
     noteholder.parentNode.removeChild(noteholder);
   else {
     // Process footnoterefs.
     for (i=0; i<spans.length; i++) {
       if (spans[i].className == "footnoteref") {
         var href = 
spans[i].getElementsByTagName("a")[0].getAttribute("href");
         href = href.match(/#.*/)[0];  // Because IE return full URL.
         n = refs[href];
         spans[i].innerHTML =
           "[<a href='#_footnote_" + n +
           "' title='View footnote' class='footnote'>" + n + "</a>]";
       }
     }
   }
}

It seems to me like checklink is incorrectly parsing this as html, and 
is incorrectly thinking the string literals in this function are html 
fragment ids.

Does that seem right?


-- 
Sebastian Kuzminsky

Received on Monday, 21 January 2013 15:10:25 UTC