- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Sat, 28 Aug 2004 16:16:37 +0200
- To: Ville Skyttä <ville.skytta@iki.fi>
- Cc: QA-dev <public-qa-dev@w3.org>
* Ville Skyttä wrote: >I have actually some initial work already done wrt modularization and >support for more document types for the link checker. I will post more >details soon, but my basic idea is to provide an event driven API (found >links and fragments/anchors are reported as "events", akin to SAX), It might in fact also make sense to report broken links through such an API. We need something that represents an instance of a "problem" within a "resource". It would have something that "identifies" the problem, a "location" that relates to the problem and arguments relevant to it. For each problem there would be additional meta data that among other things describes the problem and its properties in more detail. For example, in a XHTML 1.0 text/html document <p xml:id = 'foo' /> One problem would be that there is no xml:id attribute for the p element in the document type, 108 -- OpenSP message ID, the "identifier" x,y -- the "location" foo -- the argument meta data then could say that * this is a violation of a "DTD constraint" * the location is the first character of the attribute name * the first argument is the name of an attribute * the text for the message is "there is no attribute %1" * the verbose text is "You have used the attribute named above..." * it is best to hightlight the entire attribute specification to help users spotting the error * the hightlighting should include the entire start-tag * the constrained is spelled out in http://example.org/spec#sec152 * ... and additional meta data could say that for example * violations of DTD constraints are "errors" * a resource that has errors is "invalid" * ... For example, my experimental Appendix C Validator does something like # check for <p lang="de">...</p> if (exists $att{'lang'} && not exists $att{'xml:lang'}) { report_problem($exp, APPC_ONLY_LANG, $end); } and has a table ... APPC_ONLY_LANG, [ 7, APPC_ERRO, "<example lang='en'> shall be ..."], ... where 7 is the section in the XHTML 1.0 SE Recommendation where the constrained is expressed, APPC_ERRO is an "error", and the text is the message text it currently uses. The location information is given through the $exp XML::Parser::Expat object (you can call current_line etc. on it). This is for some "problems" more sophisticated, e.g. if the document uses ' the location identifies the start position of the entity reference, for APPC_ONLY_LANG only the element start position is reported. The point here is that checkers should report report as little as possible and everything else comes from different sources, e.g. we want pluggable message text to enable localization, hence the checker should not hardcode the message text in the source code. Harmonization in this regard also allows for more interoperation among the various services, we could have serializations in XML/RDF/EARL/N3/ XHTML/Perl/... and create a meta-service that just presents the results of the sub-services. In fact, the forthcoming SAX based architecture of the Markup Validator will be something quite like that, using pseudo- code, my $m = XML::SAX::Machines::Pipeline( 'SGML::Parser::OpenSP::SAX', 'XML::SAX::Filter::VerboseLocations', 'W3C::Markup::XHTML::AttributeChecks', # href="ö" not allowed 'W3C::Markup::XHTML::NestingChecks', # <a><em><a>... not allowed 'W3C::Markup::XHTML::IntegritiyChecks', # Content-Style-Type etc. 'W3C::Markup::XHTML::AppendixCChecks', # <p/>, <br></br>, etc. 'XML::SAX::Filter::XHTML::Outline', 'XML::SAX::Filter::XHTML::ParseTree', ...); $m->parse_file(...); (it can't work exactly like that but you should get the idea) in order to present the results from ::SAX, ::AttributeChecks, ::NestingChecks, ::IntegrityChecks, and ::AppendixCChecks, the Validator would require that the "problems" these filters report are in some common format so that it does not have to implement custom code to process the output from each reporter individually. With such harmonized problem reports in place, it would be easy to add something that includes the output from the link checker in the results. Or other services, like * http://www.w3.org/2003/09/nschecker * http://www.w3.org/2004/07/references-checker-ui * http://www.w3.org/2002/01/spellchecker * http://www.w3.org/2001/07/pubrules-form * ... >There's at least one thing though that SAX filtering alone does not seem >to be suitable for in the context of recursive link checking (nor does >the current link checker code). If we want to avoid fetching documents >(possibly) multiple times and support "dynamic" stuff like XPointer >(when used with anything more complex than #xpointer(id('foo'))), that >would AFAICT require us to store the target document or its DOM tree or >something for the duration of the recursive run. Some of the things I am "working" on do indeed require that the source code of the original document is available, for example the Appendix C Validator needs to know whether a document uses ' to get a ' or something else, this is not possible using SAX alone as such information is lost (but you can get there with proper location information). That should not be a problem. Though it seems that is not what you had in mind here, I am not sure how it cannot be avoided to fetch documents multiple times, all you need is a cache of the URIs that were fetched already. For XPointer there is indeed a problem, I am not aware of any XPointer implementation that works on a SAX event stream, you would rather need a DOM (or even custom-DOM) tree to check whether such a link is broken or not. >Another thing that is somewhat a dark area to me is parsing for example >XML Schemas in order to find out what exactly is a link or an anchor for >"unknown" document types, and what is the relation of this in the link >checker wrt the Markup Validator. XML Schema does not allow schemas to say what it considered a link and what is not, you can only say that the lexical space, or rather the value space for something is xsd:anyURI, that might well be something that is not a "link" (you know, this locators as identifiers thing). There is a not-very-active linking task force of the Hypertext CG that seeks to find solutions to this "problem"... >Setting up a Wiki page for documenting the ideas and comments for $TNV >could be a good idea. http://esw.w3.org/topic/MarkupValidator http://esw.w3.org/topic/CssValidator http://esw.w3.org/topic/LinkChecker Feel free to create the latter, the other ones exist already.
Received on Saturday, 28 August 2004 14:17:20 UTC