Re: Using XPointer with HTML from Sean B. Palmer on 2002-04-10 (www-annotation@w3.org from January to June 2002)

From: Sean B. Palmer <sean@mysterylights.com>
Date: Wed, 10 Apr 2002 13:21:15 +0100
To: "Jim Ley" <jim@jibbering.com>, "Steven Pemberton" <steven.pemberton@cwi.nl>
Cc: <www-annotation@w3.org>, "HTML WG" <w3c-html-wg@w3.org>
Message-ID: <009101c1e08a$3825e200$0c560150@localhost>
> > If the document has id's then you can just use a URI,
> > problem solved. But I thought a more general solution
> > was being sought.

Please be careful with terminology here: a URI plus fragment ID is a
URI-reference, according to the BNF and terminology used in RFC 2396
[1].

> Can you have an xhtml1.1 document that is served as
> text/html ?

Of course. A better question is whether XHTML 1.1 served as either
text/html, text/xml, or application/xhtml+xml makes any sense; but
that's out of scope for this discussion, I feel.

> [...] by your argument
> http://jibbering.example/example#xpointer(id('Moomin'))
> points to a different fragment depending on whether the
> resource has an xhtml or html mime-type.

Perhaps. If it's sent as text/xml (or any of the MIME types covered by
XPointer), then if there's a Moomin ID declared, then it points to
that fragment of content. If on the other hand it's sent as a MIME
type not covered by XPointer, then it's basically going to point to
something "undefined".

> > > The question is not how does XPointer into HTML
> > > compare to XPointer into XHTML, but can we point
> > > to something in an HTML document?

[[[
   The URI specification [URI] notes that the semantics of a fragment
   identifier (part of a URI after a "#") is a property of the data
   resulting from a retrieval action, and that the format and
   interpretation of fragment identifiers is dependent on the media
type
   of the retrieval result.

   For documents labeled as text/html, the fragment identifier
   designates the correspondingly named element; any element may be
   named with the "id" attribute, and A, APPLET, FRAME, IFRAME, IMG
and
   MAP elements may be named with a "name" attribute.  This is
described
   in detail in [HTML40] section 12.
]]] - http://www.ietf.org/rfc/rfc2854.txt

If a document is served as text/html and the fragment that you wish to
identify is not named in the aforementioned manner, then you can't
point to it; it's as simple as that.

But the argument isn't as clear cut as that, as we all know. text/html
and text/xml representations can be served together as variants of a
single resource. In this case, the URI-ref:-

   http://jibbering.example/example#xpointer(id('Moomin'))

Is utterly broken. The semantics of the URI-ref seem to depend greatly
upon a retrieval action, and in this case, the semantics of the
XPointer change depending upon one's accept headers. Accepting
text/html means that the pointer is undefined, and accepting text/xml
means that the pointer is defined.

This is a problem with XPointer - it doesn't work very well with
conventional HTTP machinery that has been around for years. Or rather,
it only works when you necessarily limit yourself to sending a single
(and perhaps fixed) variant. Many people believe that fragments must
be persistent; in the case of XPointer, that means that your XML
document had better not change one iota. So, if you want to use
XPointer, you have to do so on a resource that has a single fixed XML
representation. That's absurd.

So, now you want to create a similar scheme for HTML. Now, since I've
been working with you on the EARL project, I certainly understand what
the rationale behind this is :-) We need to be able to refer to pieces
of HTML documents in order to say things about them - perhaps in the
context of an EARL eevaluation, or perhaps outside of it. Using some
kind of pointer mechanism would be fine if (as noted above) people
were constrained to sending their HTML document without any variants,
and without changing the document in such a way that would break the
HTML pointer.

And really I'm just clarifying for the sake of people external to this
discussion who may lack the context. Jim, I remember you saying to me
that the requirement for HTML pointer  is simply of being able to
point into a *single* representation of HTML, and that the requirement
for XPointer is of being able to point into a single represenation of
XML. But many people don't seem to get it, and that's a bit of a
shame, so it needs to be underlined at every opportunity.

So the answer to your question is that, yes, the generic idea of an
HTML pointer is good, useful, and architectually sound, but that one
shouldn't abuse it in the manner that XPointer have abused theirs. In
fact, you can't abuse HTML pointer since there is no way that you can
make an amendment to the HTML media type RFC. Win-win.

As for the canonicalization issues, that's a tough one. As Steven has
pointed out, some HTML document are just so broken that even HTML tidy
throws up all over them. OTOH, some documents are valid HTML, and for
those a regular HTML pointer is certainly possible, given work.

So, I suggest that you might want to look into the following
algorithm:-

1) If the representation validates according to one of the standard
HTML DTDs, then use a standard canonicalization, and an HTML pointer.
2) If the representation doesn't validate, then you have two choices:-
2a) Attempt to canonicalize it anyway, and use an HTML pointer
2b) Point to the piece of information using a column and line number,
a regular expression, or something else based on the reduced hash
experiments that Nick started working on.

I hope that helps.

[1] http://www.ietf.org/rfc/rfc2396.txt

--
Kindest Regards,
Sean B. Palmer
@prefix : <http://purl.org/net/swn#> .
:Sean :homepage <http://purl.org/net/sbp/> .
Received on Wednesday, 10 April 2002 08:22:14 UTC