Re: link checker and IRIs (+m12n) from Ville Skyttä on 2004-08-28 (public-qa-dev@w3.org from August 2004)

From: Ville Skyttä <ville.skytta@iki.fi>
Date: Sat, 28 Aug 2004 12:16:58 +0300
To: QA-dev <public-qa-dev@w3.org>
Message-Id: <1093684618.32401.444.camel@bobcat.mine.nu>
On Sat, 2004-08-28 at 04:30, Martin Duerst wrote:
> Hello Bjoern,
> 
> At 07:48 04/08/27 +0200, Bjoern Hoehrmann wrote:
> >* Martin Duerst wrote:
> > >I'm planning to work a bit on the link checker in the next few days,
> > >to make it comply with the IRI spec.
> >
> >My understanding is that checklink only supports HTML and XHTML 1.x
> >documents,
> 
> Yes. But I think it would be fairly easy to extend it to parse
> other things, such as SVG,...

I have actually some initial work already done wrt modularization and
support for more document types for the link checker.  I will post more
details soon, but my basic idea is to provide an event driven API (found
links and fragments/anchors are reported as "events", akin to SAX), and
I have some crude but already at least partially working implementations
supporting XML Base, XLink, XInclude, xml:id and some initial work on
links in CSS.  There are bits here and there that are of general nature
and would be (and are) best placed in generic CPAN modules.

The XML things are currently implemented as XML::SAX compliant filters,
so it'll be trivial to plug them into the filter chain of whatever app
is using XML::SAX.

There's at least one thing though that SAX filtering alone does not seem
to be suitable for in the context of recursive link checking (nor does
the current link checker code).  If we want to avoid fetching documents
(possibly) multiple times and support "dynamic" stuff like XPointer
(when used with anything more complex than #xpointer(id('foo'))), that
would AFAICT require us to store the target document or its DOM tree or
something for the duration of the recursive run.  I have not thought
about this too much yet, so comments and ideas are very much welcome.

Another thing that is somewhat a dark area to me is parsing for example
XML Schemas in order to find out what exactly is a link or an anchor for
"unknown" document types, and what is the relation of this in the link
checker wrt the Markup Validator.

> >Is there any chance you could implement whatever you had in mind here
> >as new stand-alone Perl modules, either in the W3C::* namespace or
> >probably even better in the more general CPAN namespaces (HTML::, URI::,
> >etc.)? It seems these would be of a mostly more general nature and
> >likely to be re-used by other tools, that's quite difficult to do with
> >inline code, and checklink is already > 2000 lines of code, we should
> >try to avoid adding significantly more code to it.

+1000

> I was myself quite frightened of the checklink code up to a few days
> ago. I'm now quite a bit less after I have looked through it a few times
> on the bus. [I don't in any way claim I understand it yet.]
> For what I'm planning for the link checker at the moment, I'm not
> sure that will become a module. But it's possible to think about
> how to move that code, or similar code,

I am pretty much familiar with it already, but some parts of it never
stop frightening me :)  I think a somewhat thorough rewrite is be a good
idea, but I also think spending time getting familiar with the current
code is not necessarily in vain.

Setting up a Wiki page for documenting the ideas and comments for $TNV
could be a good idea.
Received on Saturday, 28 August 2004 09:17:03 UTC