draft response on bug 5562 from C. M. Sperberg-McQueen on 2008-04-22 (public-sml@w3.org from April 2008)

From: C. M. Sperberg-McQueen <cmsmcq@acm.org>
Date: Tue, 22 Apr 2008 17:39:00 -0600
To: public-sml@w3.org
Cc: "C. M. Sperberg-McQueen" <cmsmcq@acm.org>
Message-Id: <C0E9F0A5-B55A-4DD8-B48C-DB17EF563C9B@acm.org>
In re: bug 5562 SML should define an XHTML href Reference Scheme
http://www.w3.org/Bugs/Public/show_bug.cgi?id=5562

The following text is a draft response on bug 5562, for consideration
by the WG.  It discharges action 178 against me
http://www.w3.org/2005/06/tracker/sml/actions/178).  Warning: very
long. Do not attempt to read in the five minutes before the WG call.

--MSM


The SML WG discussed this issue at some length at our face to face
meeting 31 March - 2 April; I have been asked to summarize our
discussions and expected resolution of the issue.

The initial issue description comes from Henry Thompson's comment #5
on bug 5513 (http://www.w3.org/Bugs/Public/show_bug.cgi?id=5513#c5):

     The SML spec. itself should define ... an XHTML href Reference
     Scheme ....  Either it's easy to do this, so you definitely
     should, or it's hard, in which case that uncovers a weakness in
     your spec.

Several questions are intertwined here, which it may be useful to try
to distinguish as far as possible.

   Q1. Is it desirable that SML be applicable to legacy data
       (i.e. to document vocabularies not designed with SML in mind)?
       Is the idea of applying SML to XHTML in itself absurd?

   Q2. Is it in fact possible to specify a reference scheme that
       would work with XHTML?

   Q3. If it is possible, or to the extent that it is possible, is
       it desirable that the SML WG should define such a scheme?

In principle, as Sandy Gao pointed out in
http://lists.w3.org/Archives/Public/public-sml/2008Feb/0271.html, it
is desirable for SML to be applicable to legacy data, with minimal or
no change to the data.  So it seems to me not unreasonable to ask
whether SML could be applied to XHTML, for example as a quick and
simple way to build a link checker.  (It is true that not everyone in
the SML WG agreed with me on that point, but eventually the WG did
agree to consider whether defining such a scheme would be technically
feasible.)

We spent most of an afternoon thinking about what would be entailed in
specifying a reference scheme for XHTML and whether it's possible at
all.  As a way of making the topic more concrete, we asked (following
HT's lead in
http://lists.w3.org/Archives/Member/w3c-xml-schema-ig/2008Mar/0002.html)
what it would take to make at least a rudimentary link checker for
XHTML documents, using SML technology.

The short answer is: yes, it's possible, within limits and with
certain ancillary assumptions, to define a reference scheme for XHTML.
But the limits are severe enough, and the ancillary assumptions
problematic enough, that in practice such a link checker is unlikely
to be of wide interest.  Some of the technical points we arrived at
may be instructive, especially the points of tension in our whiteboard
design.

1) SML assumes by design (a) that not every hyperlink is an SML
reference to be validated, and (b) that SML references point to
elements within the model.

References to documents outside the model will not, in normal
practice, be SML references (and if they are, they will be unresolved
-- by design, SML references resolve to elements *within the model*).
HTML similarly assumes that the target of every references is part of
the Web.

In implementations which manage the model by keeping an explicit list
of the documents in the model, an SML-based link checker is unlikely
to be of much interest unless (for example) the model contains a
substantial part of an organization's web site, and the SML-based link
checker is expected to check only links to other resources on the same
site.  If the W3C's link checker is any guide, most users of link
checkers would find this restriction to local links off-putting.

In implementations which manage the model in other ways, and are
willing to infer, from the fact than an SML reference points at a
document D, that document D is part of the model, the practical issue
involved is likely to be the opposite one.  SML validation involves
checking the model, not just checking individual documents within the
model.  So as soon as document D is added to the model, the task of
checking document D's outgoing links is added to the work needed to
validate the model.


2) SML assumes by design that the documents in the model are XML
documents.

This is important for validation, and it simplifies the design space
considerably.  (It is also, for what it's worth, specified in the
charter of the Working Group, so changing this assumption is unlikely
to be easy.)

It is unlikely that all the outgoing links in an HTML document will
point to XML documents.  What is to be done with links to images in
formats other than SVG?  Or links to CSS stylesheets, and Javascript
script files, and HTML documents?

Such resources can be made checkable in SML, if we are able to assume
that our URI resolver performs double duty as a proxy server and is
fitted out with a fairly comprehensive set of XML lenses, such that
whatever resource we request, we get back an XML representation of
that resource.  XHTML and other XML documents are served as is (but
see below), HTML documents are filtered through Tidy, and binary image
formats are translated into base64 (possibly represented using some
sort of MTOM interface), with appropriate metadata in the other
elements and attributes of the document.

The use of XML lenses is not currently widespread, but the idea is not
new with the SML WG.  See, for example, the paper by Tony Lavinio at
XML 2007 (URIResolver augmented with XML lenses for EDI and CSV) and
(XML lenses for viewing relational data) virtually every vendor
currently shipping a SQL product.


3) Users of link checkers frequently wish to know that the resource
returned is as expected; they may wish to know, for example, that
img/@src points to a resource returned by the server with an image/*
MIME type -- or for some sites, that it is specifically one of
image/png, image/jpeg, or image/svg.  Script links and stylesheet
links should similarly be checked for appropriate MIME types.  Some
users want to ensure that the title of the page retrieved matches a
regex constructed from the link text (to detect URIs which still
resolve but no longer include the information which led to the link
being made in the first place).

This is reasonably easy to accomplish, if the XML lenses used by our
URI resolver / proxy include metadata from the HTTP header.
Schematron assertions can be used to check compatibilty of the
resource with the link (although they will have some trouble with the
regular-expression requirement).


4) SML assumes that SML references are elements.

The biggest difficulty in making a reference scheme which supports the
hyperlinks of XHTML is that unlike SML, XHTML does not assume that
each hyperlink is carried by a distinct element.

The invariant that SML references are elements (not attributes, and
not sequences of elements) allows a number of useful features and
important simplifications.  It allows some elements bound to a
particular declaration or governed by a particular type to be SML
references, and others not (they carry sml:ref="true" if and only if
they are SML references).  It allows the same reference to contain
representations of the reference using multiple reference schemes
(e.g. the SML URI reference scheme and the EPR reference scheme).  And
it makes possible a relatively simple, straightforward definition of
reference cycles (needed for the 'acyclic' constraint).

The rule that each reference links to at most one target element also
allows a certain simplification; if references could have multiple
targets it would be necessary to add machinery for specify which
outgoing link, from a given reference type, should be subject to which
sets of constraints.  If each link source is a separate element, much
less machinery is needed.

It's fairly straightforward to specify a reference scheme for xhtml:a
elements, which seeks the target of the link by resolving the URI in
./@href.  But this is not the same as supporting XHTML hyperlinking.
We did not attempt an exhaustive survey of XHTML, but we spent an hour
or so considering what would be involved in defining a reference
scheme to support the xhtml:object element, which carries three
outgoing links:

     @classid (identifies an implementation)
     @data (reference to object's data)
     @usemap (use client-side image map)

or the xhtml:image object, which also carries three:

     @src (URI of image to embed)
     @longdesc (link to long description [complements alt])
     @usemap (use client-side image map)

We found no good approaches to supporting these elements.  It's
possible to define a scheme that pays attention to only one of the
outgoing links, of course, but that did not seem to count as solving
the assigned problem, since it fails to check two out of three
potential outgoing links.  One could define three different schemes,
one for each outgoing link, but SML specifies by design that if an SML
reference is provided with multiple reference schemes, then each
scheme must resolve to the same target element.  That's a crucial
assumption for allowing reliable use of multiple schemes, so we do not
wish to change it: real support for xhtml:object or xhtml:img would
require allowing SML references to be associated with attributes, not
elements.  And that, in turn, would make it impossible (or implausibly
difficult) to allow some instances of a particular declaraton to be
SML references, without requiring that all be.  See point 1 above.

It is instructive to note that the xhtml:object and xhtml:img elements
defeated our efforts to define an XHTML reference scheme for pretty
much the same reasons that they have defeated efforts to define XHTML
hyperlinking in terms of XLink.  XLink makes many of the same
assumptions as SML, and thus suffers from the same impedance mismatch
as SML in trying to describe XHTML hyperlinking.

In sum, defining a useful XHTML link checker in SML terms would
require changes to a number of properties of the current SML design.

   - The assumption that not every hyperlink is an SML reference,
     and that it is important to be able to specify which links
     are, and which are not, on a link by link basis.
   - The assumption that SML references point to elements.
   - The assumption that SML references point to targets
     within the model.
   - The assumption that SML validation is validation of the model
     as a whole.
   - The assumption that SML validation needs to be a bounded
     activity, guaranteed to terminate.
   - The assumption that no two outgoing SML references start
     from the same reference element.

If we changed some or all of these assumptions, it might be possible
to do link checking with SML.  This would certainly be an advantage.
But we believe the gain would be relatively modest.  XHTML link
checking has no need of some parts of SML, and indeed it has little
need of *most* of SML: at a first approximation, every element or
attribute of type URI in the XHTML vocabulary should be checked to
ensure that it resolves.  There is no point, for an XHTML link
checker, in specifying that some, but not all, hyperlinks are to be
validated.  There is no use for the targetType, targetElement, or
acyclic constraints.  Weighing the design cost against the gain
(whether theoretical or practical), the SML WG feels the cost in
complexity and variability would far outweigh the gain.

A reference scheme that does not cover all of XHTML but only the
simpler hyperlinking elements in the vocabulary (e.g. xhtml:a) would
be possible, but, we think, somewhat less interesting.

While we understand the logic behind the suggestion that "[if] it's
hard, ... that uncovers a weakness in [the] spec", we believe that in
fact the difficulties stem not from weaknesses in the spec, but from
invariants which focus the application area and simplify both the
specification and implementation of SML.  In other words: these aren't
weaknesses, they are design choices.

Having discussed the degree to which a reference scheme compatible
with SML's design could be formulated for XHTML, we turn finally to
the question of whether such a scheme should be defined by the SML WG
or, if appropriate, by others.  For several reasons, we incline against
defining such a scheme ourselves.

   - Hypertext and XHTML are not the focus of interest for most
     members of the SML WG; The applications of SML envisaged by the
     current membership of the WG uniformly involve very different
     kinds of data. We are not confident that our interests and
     expertise match up well with the requirements of the task.

   - If our analysis of the technical problems is correct, such a
     scheme is unlikely to be of much practical use, and thus unlikely
     to be of interest except as an intellectual exercise.

   - We believe the interoperability of SML-annotated schemas and SML
     procesors will be best served if the SML user community focuses on
     a small number of reference schemes, ideally one.  We have in part
     for this reason removed the EPR scheme from the SML spec and moved
     it to a Working Group Note (not yet published).

So our intention is to close this issue with a disposition of WONTFIX.
We hope that the explanation above makes clear that we did make a
good-faith effort to resolve the issue, and persuades you that in face
WONTFIX is the correct resolution.
Received on Tuesday, 22 April 2008 23:39:35 UTC