Re: link checker and IRIs (+m12n) from Bjoern Hoehrmann on 2004-08-28 (public-qa-dev@w3.org from August 2004)

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Sat, 28 Aug 2004 16:16:37 +0200
To: Ville Skyttä <ville.skytta@iki.fi>
Cc: QA-dev <public-qa-dev@w3.org>
Message-ID: <4136797e.215936320@smtp.bjoern.hoehrmann.de>
* Ville Skyttä wrote:
>I have actually some initial work already done wrt modularization and
>support for more document types for the link checker.  I will post more
>details soon, but my basic idea is to provide an event driven API (found
>links and fragments/anchors are reported as "events", akin to SAX),

It might in fact also make sense to report broken links through such an
API. We need something that represents an instance of a "problem" within
a "resource". It would have something that "identifies" the problem, a
"location" that relates to the problem and arguments relevant to it. For
each problem there would be additional meta data that among other things
describes the problem and its properties in more detail. For example, in
a XHTML 1.0 text/html document

  <p xml:id = 'foo' />

One problem would be that there is no xml:id attribute for the p element
in the document type,

  108 -- OpenSP message ID, the "identifier"
  x,y -- the "location"
  foo -- the argument

meta data then could say that

  * this is a violation of a "DTD constraint"
  * the location is the first character of the attribute name
  * the first argument is the name of an attribute
  * the text for the message is "there is no attribute %1"
  * the verbose text is "You have used the attribute named above..."
  * it is best to hightlight the entire attribute specification to
    help users spotting the error
  * the hightlighting should include the entire start-tag
  * the constrained is spelled out in http://example.org/spec#sec152
  * ...

and additional meta data could say that for example

  * violations of DTD constraints are "errors"
  * a resource that has errors is "invalid"
  * ...

For example, my experimental Appendix C Validator does something like

  # check for <p lang="de">...</p>
  if (exists $att{'lang'} && not exists $att{'xml:lang'})
  {
    report_problem($exp, APPC_ONLY_LANG, $end);
  }

and has a table

  ...
  APPC_ONLY_LANG, [ 7, APPC_ERRO, "<example lang='en'> shall be ..."],
  ...

where 7 is the section in the XHTML 1.0 SE Recommendation where the
constrained is expressed, APPC_ERRO is an "error", and the text is the
message text it currently uses. The location information is given
through the $exp XML::Parser::Expat object (you can call current_line
etc. on it). This is for some "problems" more sophisticated, e.g. if the
document uses &apos; the location identifies the start position of the
entity reference, for APPC_ONLY_LANG only the element start position is
reported.

The point here is that checkers should report report as little as
possible and everything else comes from different sources, e.g. we want
pluggable message text to enable localization, hence the checker should
not hardcode the message text in the source code.

Harmonization in this regard also allows for more interoperation among
the various services, we could have serializations in XML/RDF/EARL/N3/
XHTML/Perl/... and create a meta-service that just presents the results
of the sub-services. In fact, the forthcoming SAX based architecture
of the Markup Validator will be something quite like that, using pseudo-
code,

  my $m = XML::SAX::Machines::Pipeline(
    'SGML::Parser::OpenSP::SAX',
    'XML::SAX::Filter::VerboseLocations',
    'W3C::Markup::XHTML::AttributeChecks',  # href="ö" not allowed
    'W3C::Markup::XHTML::NestingChecks',    # <a><em><a>... not allowed
    'W3C::Markup::XHTML::IntegritiyChecks', # Content-Style-Type etc.
    'W3C::Markup::XHTML::AppendixCChecks',  # <p/>, <br></br>, etc.
    'XML::SAX::Filter::XHTML::Outline',
    'XML::SAX::Filter::XHTML::ParseTree',
    ...);

  $m->parse_file(...);

(it can't work exactly like that but you should get the idea) in order
to present the results from ::SAX, ::AttributeChecks, ::NestingChecks,
::IntegrityChecks, and ::AppendixCChecks, the Validator would require
that the "problems" these filters report are in some common format so
that it does not have to implement custom code to process the output
from each reporter individually. With such harmonized problem reports
in place, it would be easy to add something that includes the output
from the link checker in the results. Or other services, like

  * http://www.w3.org/2003/09/nschecker
  * http://www.w3.org/2004/07/references-checker-ui
  * http://www.w3.org/2002/01/spellchecker
  * http://www.w3.org/2001/07/pubrules-form
  * ...

>There's at least one thing though that SAX filtering alone does not seem
>to be suitable for in the context of recursive link checking (nor does
>the current link checker code).  If we want to avoid fetching documents
>(possibly) multiple times and support "dynamic" stuff like XPointer
>(when used with anything more complex than #xpointer(id('foo'))), that
>would AFAICT require us to store the target document or its DOM tree or
>something for the duration of the recursive run.

Some of the things I am "working" on do indeed require that the source
code of the original document is available, for example the Appendix C
Validator needs to know whether a document uses &apos; to get a ' or 
something else, this is not possible using SAX alone as such information
is lost (but you can get there with proper location information). That
should not be a problem. Though it seems that is not what you had in
mind here, I am not sure how it cannot be avoided to fetch documents
multiple times, all you need is a cache of the URIs that were fetched
already. For XPointer there is indeed a problem, I am not aware of
any XPointer implementation that works on a SAX event stream, you would
rather need a DOM (or even custom-DOM) tree to check whether such a link
is broken or not.

>Another thing that is somewhat a dark area to me is parsing for example
>XML Schemas in order to find out what exactly is a link or an anchor for
>"unknown" document types, and what is the relation of this in the link
>checker wrt the Markup Validator.

XML Schema does not allow schemas to say what it considered a link and
what is not, you can only say that the lexical space, or rather the
value space for something is xsd:anyURI, that might well be something
that is not a "link" (you know, this locators as identifiers thing).
There is a not-very-active linking task force of the Hypertext CG that
seeks to find solutions to this "problem"...

>Setting up a Wiki page for documenting the ideas and comments for $TNV
>could be a good idea.

http://esw.w3.org/topic/MarkupValidator
http://esw.w3.org/topic/CssValidator
http://esw.w3.org/topic/LinkChecker

Feel free to create the latter, the other ones exist already.
Received on Saturday, 28 August 2004 14:17:20 UTC