Re: FragIds in semantic web (ACTION-543)

Jonathan,

On 22 Apr 2011, at 13:57, Jonathan Rees wrote:
> On Wed, Apr 6, 2011 at 6:18 PM, Jeni Tennison
> <jeni.tennison@googlemail.com> wrote:
>>  Content negotiation becomes extremely difficult when the interpretation
>>  of fragment identifiers depends on the MIME type as there is no
>>  guarantee that the syntax of a fragment identifier that is legal for
>>  one MIME type is also legal (or interpreted in an equivalent way) for
>>  another MIME type. For example, the common `#identifier` syntax for
>>  HTML is not consistent with the XPointer-based syntax defined for XML.
> 
> I don't understand this. Can you explain or give an example? It seems
> to me that XPointer and HTML were designed to be compatible - in fact
> they overlap in the application/xhtml+xml media type.

You're right, I misread the XPointer spec. I know I also lack background in the discussions that you've already had about this and might well have the wrong end of the stick about the issue.

Would a better example be to illustrate using images? If there's a request that addresses an area within an image:

  http://www.example.org/picture#xywh=160,120,320,24

then this makes sense for something that returns an image/jpeg (as described in 'Media Fragments URI 1.0' [1]) but if the conneg returned an image/svg+xml image instead, that fragment syntax isn't appropriate as the media type registration for image/svg+xml [2] defines that it inherits the XPointer syntax from RFC 3023 (or its successor) or the SVG Views syntax [3], and therefore would be: 

  http://www.example.org/picture#svgView(viewBox(160,120,320,24))

The point being that the first fragment can't be interpreted if the resulting media type is image/svg+xml and the second can't be interpreted if the resulting media type is image/jpeg.

> Are you referring to RFC 3023, or to 3023bis?

RFC 3023. But the above example is better and doesn't need either.

> The problem of inconsistency between simultaneous representations has
> been raised before and should be raised here - it's not syntactic,
> it's semantic. If I have French and Spanish HTML files, both with #foo
> fragids, the fragid "identifies" different elements in the two
> documents - since the documents are different. Yet there is no problem
> in practice as long as the elements serve the same function in
> interaction (they "say the same thing" in the two different
> languages).

So the issue here is about whether fragment identifiers are identifying elements or something more abstract (eg a paragraph)? I guess that's determined by the media type definition?

> I think the AWWW story about consistency between fragment identifier
> meaning among representations is probably worth repeating here, or
> referring to, not because it is a complete solution to the problem but
> because it's the closest thing we have so far to an interpretation of
> 3986 that makes sense.

Agreed. What I don't understand about AWWW is where it says "representation providers must manage content negotiation carefully when used with a URI that contains a fragment identifier" given that a representation provider cannot know whether a URI is going to be used with a fragment identifier and what those fragment identifiers are going to be.

Indeed, RFC 3986 seems to describe the fact that the representation provider cannot control what fragment identifiers are used with the representations that they provide as a benefit:

  "Although
   this separate handling is often perceived to be a loss of
   information, particularly for accurate redirection of references as
   resources move over time, it also serves to prevent information
   providers from denying reference authors the right to refer to
   information within a resource selectively.  Indirect referencing also
   provides additional flexibility and extensibility to systems that use
   URIs, as new media types are easier to define and deploy than new
   schemes of identification."

It seems like RFC 3986 is saying that new fragment identifier schemes can be invented for media types at any point (and this is a good thing); given that, how can a representation provider hope to be able to identify whether serving two different representations through conneg is or isn't going to cause a problem in the future, especially when there might be some fragment identifiers that are OK (eg bare names) and others that aren't (eg XPointer syntax for application/xhtml+xml but not text/html)?

In other words, surely it's the person making the link that needs to be careful about using a fragment identifier that works across all conneg'd resources, not the representation provider?

>>  This is exacerbated in common semantic web practice, which not only
>>  makes heavy use of content negotiation but in which URLs with fragment
>>  identifiers are used to identify real-world Things.
> 
> Ouch! Since when is the Web not part of the real world? Just read the
> newspapers...
> 
> I think you should just say "things that are not document fragments".

I mean 'things that are non-information resources' but that phrase makes people's eyes roll ;)

> I am not convinced that conneg gets heavy use in semantic web contexts
> - I once spent an hour or so trying to find a single example of RDF
> conneg, and the only one I found was FOAF. Just for my own
> edification, could you elaborate on this?

When linked data is published well, you always provide at least a (human-readable) HTML version and a (machine-readable) RDF (of some sort) version of a resource, and probably more than one RDF format because there are so many to choose from -- you don't just want to have RDF/XML 'cos it sucks and you don't just want to have Turtle because it isn't as well supported as RDF/XML.

There are different ways of doing it, described in 'Cool URIs for the Semantic Web' [4]. DBPedia is an example of one of the patterns, in which if you request http://dbpedia.org/resource/Berlin you get 303'ed to a different resource depending on your Accept header:

  Accept: application/rdf+xml => http://dbpedia.org/data/Berlin.xml
  Accept: text/n3             => http://dbpedia.org/data/Berlin.n3
  Accept: text/html           => http://dbpedia.org/page/Berlin

In the linked data pages we serve from data.gov.uk, we conneg off the document rather than the non-information resource; if you try some different Accept headers with http://reference.data.gov.uk/doc/department/co you'll see what I mean.

Another example is the BBC Wildlife finder. You can try http://www.bbc.co.uk/nature/life/Pygmy_Three-toed_Sloth with Accept: application/rdf+xml for example, and you'll get RDF/XML rather than the usual HTML.

[snip]
> Here's something interesting that I've just thought of - you might
> think that a fragid can at least be resolved locally, i.e. if the
> reference occurs in representation A then its definition can be
> expected to be found in representation A. But it seems likely that in
> RDF-conneg situations you might have a reference in A that would have
> to be resolved in representation B, e.g. the RDF version linking to
> the HTML version or vice versa. Does this happen much in practice? We
> see occurrences of FOAF RDF fragids in the FOAF HTML file, for
> example, and maybe vice versa.

There is a school of thought in linked data circles that says that when hash URIs are used for non-information resources (as in linked data), they are explicitly not addressing fragments within any content (not an element, not a paragraph), and therefore *no* representation should contain a fragment that is so identified.

RFC 3986 seems to allow this as it says (my emphasis):

   Each representation should either define the
   fragment so that it corresponds to the same secondary resource,
   regardless of how it is represented, *or should leave the fragment
   undefined (i.e., not found)*.

> For me the biggest problem with conneg+fragid is that it destroys the
> follow your nose story. Given a particular fragid there is no way to
> know ahead of time which representation it's defined in and therefore
> what to try to conneg for. If you get it wrong, there's no way to
> iterate through all the representations looking for the fragid
> definition, except in the unlikely event the server does TCN.

I think what we're seeing is that there are three distinct types of fragment identifiers now:

  1. The traditional anchor fragment identifier, as used in HTML, where you use an id or xml:id or other identifying mechanism to label a part of the content and it can then be addressed through a bare name fragment identifier. In this case, the representation provider can and should ensure that the representations all have the same anchors.

  2. The semantic web non-information resource hash URI, in which the fragment identifier is being used to create a unique URI that provides easy resolution to a representation that describes that URI. In this case, a representation provider is responsible for providing *descriptions* of the resource identified by the URI at the relevant place, but the representations don't contain anchors for those resources.

  3. The programmatic fragment identifier, such as the ones for images and video, XPointer, or many hash-bang URIs, which are often unconstrained. The construction of these fragment identifiers is determined by the person doing the linking rather than the representation provider. They are also much more closely bound to the media type of the representation: it doesn't make sense to select time slice from an HTML page, or an XML element within a JPEG, and the hash-bang URI won't work without HTML+JS.

Your concern is that the second class of fragment identifiers are indistinguishable from the first (they look like anchors) and therefore the links that use them are invalid. Have I got that right?

Cheers,

Jeni

[1]: http://www.w3.org/TR/media-frags/
[2]: http://www.imc.org/ietf-xml-mime/mail-archive/msg01153.html
[3]: http://www.w3.org/TR/SVG/linking.html#SVGFragmentIdentifiers
[4]: http://www.w3.org/TR/cooluris/
-- 
Jeni Tennison
http://www.jenitennison.com

Received on Tuesday, 26 April 2011 08:53:51 UTC