- From: <bugzilla@farnsworth.w3.org>
- Date: Thu, 08 May 2008 17:22:05 +0000
- To: public-sml@w3.org
- CC:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=5562 cmsmcq@w3.org changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED ------- Comment #3 from cmsmcq@w3.org 2008-05-08 17:22 ------- The SML WG discussed this issue at some length at our face to face meeting 31 March - 2 April; I have been asked to summarize our discussions and expected resolution of the issue. The initial issue description comes from Henry Thompson's comment #5 on bug 5513 (http://www.w3.org/Bugs/Public/show_bug.cgi?id=5513#c5): The SML spec. itself should define ... an XHTML href Reference Scheme .... Either it's easy to do this, so you definitely should, or it's hard, in which case that uncovers a weakness in your spec. Several questions are intertwined here, which it may be useful to try to distinguish as far as possible. Q1. Is it desirable that SML be applicable to legacy data (i.e. to document vocabularies not designed with SML in mind)? Is the idea of applying SML to XHTML in itself absurd? Q2. Is it in fact possible to specify a reference scheme that would work with XHTML? Q3. If it is possible, or to the extent that it is possible, is it desirable that the SML WG should define such a scheme? In principle, as Sandy Gao pointed out in http://lists.w3.org/Archives/Public/public-sml/2008Feb/0271.html, it is desirable for SML to be applicable to legacy data, with minimal or no change to the data. So it seems to me not unreasonable to ask whether SML could be applied to XHTML, for example as a quick and simple way to build a link checker. (It is true that not everyone in the SML WG agreed with me on that point, but eventually the WG did agree to consider whether defining such a scheme would be technically feasible.) We spent most of an afternoon thinking about what would be entailed in specifying a reference scheme for XHTML and whether it's possible at all. As a way of making the topic more concrete, we asked (following HT's lead in http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2008Mar/0002.html) what it would take to make at least a rudimentary link checker for XHTML documents, using SML technology. The short answer is: yes, it's possible, within limits and with certain ancillary assumptions, to define a reference scheme for XHTML. But the limits are severe enough, and the ancillary assumptions problematic enough, that in practice such a link checker is unlikely to be of wide interest. Some of the technical points we arrived at may be instructive, especially the points where our whiteboard design exhibited shortcomings and limitations. 1) SML assumes by design (a) that not every hyperlink is an SML reference to be validated, and (b) that SML references point to elements within the model. References to documents outside the model will not, in normal practice, be SML references (and if they are, they will be unresolved -- by design, SML references resolve to elements *within the model*). HTML similarly assumes that the target of every references is part of the Web. SML's design assumption has some disappointing implications for the design of any SML-based link checker for XHTML: either the link checker is likely to check too few links, or too many, depending on how the SML implementation chooses to manage its knowledge of what documents are in the model. Some implementations manage the model by keeping an explicit list of the documents in the model; any link going to a resource outside that list won't be checked. An SML-based link checker along these lines seems unlikely to be of much interest unless (for example) the model contains a substantial part of an organization's web site, and the SML-based link checker is expected to check only links to other resources on the same site. If the W3C's link checker is any guide, most users of link checkers would find this restriction to local links off-putting. Other implementations manage the model in other ways and are willing to infer, from the fact than an SML reference points at a document D, that document D is part of the model. In such implementations, the practical issue involved is likely to be the opposite one: SML validation involves checking the model, not just checking individual documents within the model. So as soon as document D is added to the model, the task of checking document D's outgoing links is added to the work needed to validate the model. There is every danger here that a link checker based on the SML model validation would acquire new work faster than it could deal with old work and would never terminate. These consequences make an SML-based link checker seem unlikely to be useful, but they do not shake the WG's faith in the design choices of having finite models that may contain more than one document, or of making model validation involve checking the entire model. 2) SML assumes by design that the documents in the model are XML documents. This is important for validation, and it simplifies the design space considerably. (It is also, for what it's worth, specified in the charter of the Working Group, so changing this assumption is unlikely to be easy.) This design assumption also poses challenges for an XHTML-based reference scheme and for an SML-based link checker. It is unlikely that all the outgoing links in an HTML document will point to XML documents. What is to be done with links to images in formats other than SVG? Or links to CSS stylesheets, and Javascript script files, and HTML documents? Such resources can be made checkable in SML, if we are able to assume that our URI resolver performs double duty as a proxy server and is fitted out with a fairly comprehensive set of XML lenses, such that whatever resource we request, we get back an XML representation of that resource. XHTML and other XML documents are served as is (but see below), HTML documents are filtered through Tidy, and binary image formats are translated into base64 (possibly represented using some sort of MTOM interface), with appropriate metadata in the other elements and attributes of the document. The use of XML lenses is not currently widespread, but the idea is not new with the SML WG. See, for example, the paper by Tony Lavinio at XML 2007 (URIResolver augmented with XML lenses for EDI and CSV) and (XML lenses for viewing relational data) virtually every vendor currently shipping a SQL product. 3) By design, SML provides both simple constraints like targetRequired and more expressive constraints (whatever can be expressed in a Schematron rule). This design choice works well for an SML-based link checker. Users of link checkers frequently wish to know that the resource returned is as expected; they may wish to know, for example, that img/@src points to a resource returned by the server with an image/* MIME type -- or for some sites, that it is specifically one of image/png, image/jpeg, or image/svg. Script links and stylesheet links should similarly be checked for appropriate MIME types. Some users want to ensure that the title of the page retrieved matches a regex constructed from the link text (to detect URIs which still resolve but no longer include the information which led to the link being made in the first place). This is reasonably easy to accomplish, if the XML lenses used by our URI resolver / proxy include metadata from the HTTP header. Schematron assertions can be used to check compatibilty of the resource with the link (although they will have some trouble with the regular-expression requirement, unless they are using a version of Schematron based on XPath 2.0, which is currently unusual). 4) SML assumes that SML references are elements. The biggest difficulty in making a reference scheme which supports the hyperlinks of XHTML is that unlike SML, XHTML does not assume that each hyperlink is carried by a distinct element. The invariant that SML references are elements (not attributes, and not sequences of elements) allows a number of useful features and important simplifications. It allows some elements bound to a particular declaration or governed by a particular type to be SML references, and others not (they carry sml:ref="true" if and only if they are SML references). It allows the same reference to contain representations of the reference using multiple reference schemes (e.g. the SML URI reference scheme and the EPR reference scheme). And it makes possible a relatively simple, straightforward definition of reference cycles (needed for the 'acyclic' constraint). 5) SML assumes by design that SML references have single elements as targets. The rule that each reference links to at most one target element also allows a certain simplification; if references could have multiple targets it would be necessary to add machinery for specifying which outgoing link, from a given reference type, should be subject to which sets of constraints. If each link source is a separate element, much less machinery is needed. It's fairly straightforward to specify a reference scheme for xhtml:a elements, which seeks the target of the link by resolving the URI in ./@href. But this is not the same as supporting XHTML hyperlinking. We did not attempt an exhaustive survey of XHTML, but we spent an hour or so considering what would be involved in defining a reference scheme to support the xhtml:object element, which carries three outgoing links: @classid (identifies an implementation) @data (reference to object's data) @usemap (use client-side image map) or the xhtml:image object, which also carries three: @src (URI of image to embed) @longdesc (link to long description [complements alt]) @usemap (use client-side image map) We found no good approaches to supporting these elements. It's possible to define a scheme that pays attention to only one of the outgoing links, of course, but that did not seem to count as solving the assigned problem, since it fails to check two out of three potential outgoing links. One could define three different schemes, one for each outgoing link, but SML specifies by design that if an SML reference is provided with multiple reference schemes, then each scheme must resolve to the same target element. That's a crucial assumption for allowing reliable use of multiple schemes, so we do not wish to change it: real support for xhtml:object or xhtml:img would require allowing SML references to be associated with attributes, not elements. And that, in turn, would make it impossible (or implausibly difficult) to allow some instances of a particular declaraton to be SML references, without requiring that all be. See point 1 above. It is instructive to note that the xhtml:object and xhtml:img elements defeated our efforts to define an XHTML reference scheme for pretty much the same reasons that they have defeated efforts to define XHTML hyperlinking in terms of XLink. XLink makes many of the same assumptions as SML, and thus suffers from the same impedance mismatch as SML in trying to describe XHTML hyperlinking. Some members of the Working Group are inclined to feel that it would be useful to work out a more flexible notion of SML reference, which did not assume that each SML reference is an XML element. Eliminating assumption 4 would be useful not only in allowing SML to describe existing vocabularies but in allowing more flexibility in the design of new vocabularies. But even those WG members most enthusiastic for the idea agree that design choice 4 makes possible a simpler design; to allow multiple references to be housed in the same element would require a somewhat more complicated way of identifying and describing references. It seems better to keep SML 1.1 simpler and postpone the idea of a more powerful and complicated design for a later version of the spec. Summary and conclusions In sum, defining a useful XHTML link checker in SML terms would require changes to a number of properties of the current SML design. - The assumption that not every hyperlink is an SML reference, and that it is important to be able to specify which links are, and which are not, on a link by link basis. - The assumption that SML references point to elements. - The assumption that SML references point to targets within the model. - The assumption that SML validation is validation of the model as a whole. - The assumption that SML validation needs to be a bounded activity, guaranteed to terminate. - The assumption that no two outgoing SML references start from the same reference element. - The assumption that no any SML reference targets at most one target element. If we changed some or all of these assumptions, it might be possible to do link checking with SML. This would certainly be an advantage. But we believe the gain would be relatively modest. XHTML link checking has no need of some parts of SML, and indeed it has little need of *most* of SML: at a first approximation, every element or attribute of type URI in the XHTML vocabulary should be checked to ensure that it resolves. There is little point, for an XHTML link checker, in specifying that some, but not all, hyperlinks are to be validated. (And when there is any point, the choice of which ones to resolve and which not to resolve is unlikely to be static or stable.) There is little or no use in a link checker for the targetType, targetElement, or acyclic constraints. Weighing the design cost against the gain (whether theoretical or practical), the SML WG feels the cost in complexity and variability would far outweigh the gain. A reference scheme that does not cover all of XHTML but only the simpler hyperlinking elements in the vocabulary (e.g. xhtml:a) would be possible, but, we think, also somewhat less interesting. While we understand the logic behind the suggestion that "[if] it's hard, ... that uncovers a weakness in [the] spec", we believe that in fact the difficulties stem not from weaknesses in the spec, but from invariants which focus the application area and simplify both the specification and implementation of SML. In other words: these aren't weaknesses, they are design choices. Having discussed the degree to which a reference scheme compatible with SML's design could be formulated for XHTML, we turn finally to the question of whether such a scheme should be defined by the SML WG or, if appropriate, by others. For several reasons, we incline against defining such a scheme ourselves. - Hypertext and XHTML are not the focus of interest for most members of the SML WG; The applications of SML envisaged by the current membership of the WG uniformly involve very different kinds of data. We are not confident that our interests and expertise match up well with the requirements of the task. - If our analysis of the technical problems is correct, such a scheme is unlikely to be of much practical use, and thus unlikely to be of interest except as an intellectual exercise. - We believe the interoperability of SML-annotated schemas and SML procesors will be best served if the SML user community focuses on a small number of reference schemes, ideally one. We have in part for this reason removed the EPR scheme from the SML spec and moved it to a Working Group Note (not yet published). So our intention is to close this issue with a disposition of WONTFIX.
Received on Thursday, 8 May 2008 17:22:42 UTC