[Bug 5562] SML should define an XHTML href Reference Scheme from bugzilla@farnsworth.w3.org on 2008-05-08 (public-sml@w3.org from May 2008)

From: <bugzilla@farnsworth.w3.org>
Date: Thu, 08 May 2008 17:22:05 +0000
To: public-sml@w3.org
CC:
Message-Id: <E1Ju9oj-0003YR-92@farnsworth.w3.org>
http://www.w3.org/Bugs/Public/show_bug.cgi?id=5562


cmsmcq@w3.org changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |ASSIGNED




------- Comment #3 from cmsmcq@w3.org  2008-05-08 17:22 -------
The SML WG discussed this issue at some length at our face to face
meeting 31 March - 2 April; I have been asked to summarize our
discussions and expected resolution of the issue.

The initial issue description comes from Henry Thompson's comment #5
on bug 5513 (http://www.w3.org/Bugs/Public/show_bug.cgi?id=5513#c5):

    The SML spec. itself should define ... an XHTML href Reference
    Scheme ....  Either it's easy to do this, so you definitely
    should, or it's hard, in which case that uncovers a weakness in
    your spec.

Several questions are intertwined here, which it may be useful to try
to distinguish as far as possible.

  Q1. Is it desirable that SML be applicable to legacy data
      (i.e. to document vocabularies not designed with SML in mind)?
      Is the idea of applying SML to XHTML in itself absurd?

  Q2. Is it in fact possible to specify a reference scheme that
      would work with XHTML?

  Q3. If it is possible, or to the extent that it is possible, is
      it desirable that the SML WG should define such a scheme?

In principle, as Sandy Gao pointed out in
http://lists.w3.org/Archives/Public/public-sml/2008Feb/0271.html, it
is desirable for SML to be applicable to legacy data, with minimal or
no change to the data.  So it seems to me not unreasonable to ask
whether SML could be applied to XHTML, for example as a quick and
simple way to build a link checker.  (It is true that not everyone in
the SML WG agreed with me on that point, but eventually the WG did
agree to consider whether defining such a scheme would be technically
feasible.)

We spent most of an afternoon thinking about what would be entailed in
specifying a reference scheme for XHTML and whether it's possible at
all.  As a way of making the topic more concrete, we asked (following
HT's lead in
http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2008Mar/0002.html)
what it would take to make at least a rudimentary link checker for
XHTML documents, using SML technology.

The short answer is: yes, it's possible, within limits and with
certain ancillary assumptions, to define a reference scheme for XHTML.
But the limits are severe enough, and the ancillary assumptions
problematic enough, that in practice such a link checker is unlikely
to be of wide interest.  Some of the technical points we arrived at
may be instructive, especially the points where our whiteboard design
exhibited shortcomings and limitations.

1) SML assumes by design (a) that not every hyperlink is an SML
reference to be validated, and (b) that SML references point to
elements within the model.

References to documents outside the model will not, in normal
practice, be SML references (and if they are, they will be unresolved
-- by design, SML references resolve to elements *within the model*).
HTML similarly assumes that the target of every references is part of
the Web.

SML's design assumption has some disappointing implications for the
design of any SML-based link checker for XHTML: either the link
checker is likely to check too few links, or too many, depending on
how the SML implementation chooses to manage its knowledge of what
documents are in the model.

Some implementations manage the model by keeping an explicit list of
the documents in the model; any link going to a resource outside that
list won't be checked.  An SML-based link checker along these lines
seems unlikely to be of much interest unless (for example) the model
contains a substantial part of an organization's web site, and the
SML-based link checker is expected to check only links to other
resources on the same site.  If the W3C's link checker is any guide,
most users of link checkers would find this restriction to local links
off-putting.

Other implementations manage the model in other ways and are willing
to infer, from the fact than an SML reference points at a document D,
that document D is part of the model.  In such implementations, the
practical issue involved is likely to be the opposite one: SML
validation involves checking the model, not just checking individual
documents within the model.  So as soon as document D is added to the
model, the task of checking document D's outgoing links is added to
the work needed to validate the model.  There is every danger here
that a link checker based on the SML model validation would acquire
new work faster than it could deal with old work and would never
terminate.

These consequences make an SML-based link checker seem unlikely to be
useful, but they do not shake the WG's faith in the design choices of
having finite models that may contain more than one document, or of
making model validation involve checking the entire model.


2) SML assumes by design that the documents in the model are XML
documents.

This is important for validation, and it simplifies the design space
considerably.  (It is also, for what it's worth, specified in the
charter of the Working Group, so changing this assumption is unlikely
to be easy.)

This design assumption also poses challenges for an XHTML-based
reference scheme and for an SML-based link checker.  It is unlikely
that all the outgoing links in an HTML document will point to XML
documents.  What is to be done with links to images in formats other
than SVG?  Or links to CSS stylesheets, and Javascript script files,
and HTML documents?

Such resources can be made checkable in SML, if we are able to assume
that our URI resolver performs double duty as a proxy server and is
fitted out with a fairly comprehensive set of XML lenses, such that
whatever resource we request, we get back an XML representation of
that resource.  XHTML and other XML documents are served as is (but
see below), HTML documents are filtered through Tidy, and binary image
formats are translated into base64 (possibly represented using some
sort of MTOM interface), with appropriate metadata in the other
elements and attributes of the document.

The use of XML lenses is not currently widespread, but the idea is not
new with the SML WG.  See, for example, the paper by Tony Lavinio at
XML 2007 (URIResolver augmented with XML lenses for EDI and CSV) and
(XML lenses for viewing relational data) virtually every vendor
currently shipping a SQL product.


3) By design, SML provides both simple constraints like targetRequired
and more expressive constraints (whatever can be expressed in a
Schematron rule).

This design choice works well for an SML-based link checker.

Users of link checkers frequently wish to know that the resource
returned is as expected; they may wish to know, for example, that
img/@src points to a resource returned by the server with an image/*
MIME type -- or for some sites, that it is specifically one of
image/png, image/jpeg, or image/svg.  Script links and stylesheet
links should similarly be checked for appropriate MIME types.  Some
users want to ensure that the title of the page retrieved matches a
regex constructed from the link text (to detect URIs which still
resolve but no longer include the information which led to the link
being made in the first place).

This is reasonably easy to accomplish, if the XML lenses used by our
URI resolver / proxy include metadata from the HTTP header.
Schematron assertions can be used to check compatibilty of the
resource with the link (although they will have some trouble with the
regular-expression requirement, unless they are using a version of
Schematron based on XPath 2.0, which is currently unusual).


4) SML assumes that SML references are elements.

The biggest difficulty in making a reference scheme which supports the
hyperlinks of XHTML is that unlike SML, XHTML does not assume that
each hyperlink is carried by a distinct element.  

The invariant that SML references are elements (not attributes, and
not sequences of elements) allows a number of useful features and
important simplifications.  It allows some elements bound to a
particular declaration or governed by a particular type to be SML
references, and others not (they carry sml:ref="true" if and only if
they are SML references).  It allows the same reference to contain
representations of the reference using multiple reference schemes
(e.g. the SML URI reference scheme and the EPR reference scheme).  And
it makes possible a relatively simple, straightforward definition of
reference cycles (needed for the 'acyclic' constraint).


5) SML assumes by design that SML references have single elements as
targets.

The rule that each reference links to at most one target element also
allows a certain simplification; if references could have multiple
targets it would be necessary to add machinery for specifying which
outgoing link, from a given reference type, should be subject to which
sets of constraints.  If each link source is a separate element, much
less machinery is needed.

It's fairly straightforward to specify a reference scheme for xhtml:a
elements, which seeks the target of the link by resolving the URI in
./@href.  But this is not the same as supporting XHTML hyperlinking.
We did not attempt an exhaustive survey of XHTML, but we spent an hour
or so considering what would be involved in defining a reference
scheme to support the xhtml:object element, which carries three
outgoing links:

    @classid (identifies an implementation)
    @data (reference to object's data)
    @usemap (use client-side image map)

or the xhtml:image object, which also carries three:

    @src (URI of image to embed)
    @longdesc (link to long description [complements alt])
    @usemap (use client-side image map)

We found no good approaches to supporting these elements.  It's
possible to define a scheme that pays attention to only one of the
outgoing links, of course, but that did not seem to count as solving
the assigned problem, since it fails to check two out of three
potential outgoing links.  One could define three different schemes,
one for each outgoing link, but SML specifies by design that if an SML
reference is provided with multiple reference schemes, then each
scheme must resolve to the same target element.  That's a crucial
assumption for allowing reliable use of multiple schemes, so we do not
wish to change it: real support for xhtml:object or xhtml:img would
require allowing SML references to be associated with attributes, not
elements.  And that, in turn, would make it impossible (or implausibly
difficult) to allow some instances of a particular declaraton to be
SML references, without requiring that all be.  See point 1 above.

It is instructive to note that the xhtml:object and xhtml:img elements
defeated our efforts to define an XHTML reference scheme for pretty
much the same reasons that they have defeated efforts to define XHTML
hyperlinking in terms of XLink.  XLink makes many of the same
assumptions as SML, and thus suffers from the same impedance mismatch
as SML in trying to describe XHTML hyperlinking.

Some members of the Working Group are inclined to feel that it would
be useful to work out a more flexible notion of SML reference, which
did not assume that each SML reference is an XML element.  Eliminating
assumption 4 would be useful not only in allowing SML to describe
existing vocabularies but in allowing more flexibility in the design
of new vocabularies.  But even those WG members most enthusiastic for
the idea agree that design choice 4 makes possible a simpler design;
to allow multiple references to be housed in the same element would
require a somewhat more complicated way of identifying and describing
references.  It seems better to keep SML 1.1 simpler and postpone the
idea of a more powerful and complicated design for a later version of
the spec.



Summary and conclusions

In sum, defining a useful XHTML link checker in SML terms would
require changes to a number of properties of the current SML design.

  - The assumption that not every hyperlink is an SML reference,
    and that it is important to be able to specify which links
    are, and which are not, on a link by link basis.
  - The assumption that SML references point to elements.
  - The assumption that SML references point to targets
    within the model.
  - The assumption that SML validation is validation of the model
    as a whole.
  - The assumption that SML validation needs to be a bounded
    activity, guaranteed to terminate.
  - The assumption that no two outgoing SML references start 
    from the same reference element.
  - The assumption that no any SML reference targets at most one
    target element.

If we changed some or all of these assumptions, it might be possible
to do link checking with SML.  This would certainly be an advantage.
But we believe the gain would be relatively modest.  XHTML link
checking has no need of some parts of SML, and indeed it has little
need of *most* of SML: at a first approximation, every element or
attribute of type URI in the XHTML vocabulary should be checked to
ensure that it resolves.  There is little point, for an XHTML link
checker, in specifying that some, but not all, hyperlinks are to be
validated.  (And when there is any point, the choice of which ones to
resolve and which not to resolve is unlikely to be static or stable.)
There is little or no use in a link checker for the targetType,
targetElement, or acyclic constraints.  Weighing the design cost
against the gain (whether theoretical or practical), the SML WG feels
the cost in complexity and variability would far outweigh the gain.

A reference scheme that does not cover all of XHTML but only the
simpler hyperlinking elements in the vocabulary (e.g. xhtml:a) would
be possible, but, we think, also somewhat less interesting.

While we understand the logic behind the suggestion that "[if] it's
hard, ... that uncovers a weakness in [the] spec", we believe that in
fact the difficulties stem not from weaknesses in the spec, but from
invariants which focus the application area and simplify both the
specification and implementation of SML.  In other words: these aren't
weaknesses, they are design choices.

Having discussed the degree to which a reference scheme compatible
with SML's design could be formulated for XHTML, we turn finally to
the question of whether such a scheme should be defined by the SML WG
or, if appropriate, by others.  For several reasons, we incline against
defining such a scheme ourselves.

  - Hypertext and XHTML are not the focus of interest for most
    members of the SML WG; The applications of SML envisaged by the
    current membership of the WG uniformly involve very different
    kinds of data. We are not confident that our interests and
    expertise match up well with the requirements of the task.

  - If our analysis of the technical problems is correct, such a
    scheme is unlikely to be of much practical use, and thus unlikely
    to be of interest except as an intellectual exercise.

  - We believe the interoperability of SML-annotated schemas and SML
    procesors will be best served if the SML user community focuses on
    a small number of reference schemes, ideally one.  We have in part
    for this reason removed the EPR scheme from the SML spec and moved
    it to a Working Group Note (not yet published).

So our intention is to close this issue with a disposition of WONTFIX.
Received on Thursday, 8 May 2008 17:22:42 UTC