Re: Facebook Linked Data from Sebastian Schaffert on 2011-10-03 (semantic-web@w3.org from October 2011)

From: Sebastian Schaffert <sebastian.schaffert@salzburgresearch.at>
Date: Mon, 3 Oct 2011 17:59:58 +0200
To: Norman Gray <norman@astro.gla.ac.uk>
Cc: Linking Open Data <public-lod@w3.org>, "semantic-web@w3.org >> semantic-web@w3.org" <semantic-web@w3.org>
Message-Id: <3B5F20E9-609D-4E07-BA5C-E46031DF8C42@salzburgresearch.at>
Dear Norman,

Sorry for replying late, I was a bit busy with other things ...


Am 28.09.2011 um 19:13 schrieb Norman Gray:

> 
> Sebastian, hello.
> 
> On 27 Sep 2011, at 13:43, Sebastian Schaffert wrote:
> 
>>> I think you're disappointed because your expectations may be wrong.
>> 
>> My expectations are my expectations. But I accept that the world maybe does not satisfy them ;-)
> 
> I often have the same feeling -- *sigh* -- I've come to think of it as the tragedy of adulthood....
> 
>> But from my experience in developing software together with industry partners out there I have a good guess that my expectations will more-or-less match with the expectations of other developers. Especially those who are not very deep in Semantic Web technologies. 
> 
> I'm nervous of opening up a potentially long discussion, but I've never understood what's so hard about httpRange-14.  Any time I've explained it to someone -- including some pretty SemWeb-sceptical RDBMS people -- they've got the idea and its importance pretty promptly.  I may have given one RDBMS colleague their SemWeb insight that way.
> 
> I do appreciate that in certain circumstances, where one doesn't have good control over the data being LODified, there's no option but to say, in effect
> 
>    <http://example.org/foo>
>        a foaf:Person;
>        a foaf:Document.
> 
> (I haven't looked at it, but I imagine that dbpedia either suffers from this or else has had to be very clever with domains to get round it).
> 
> According to httpRange-14, of course, one of those statements must simply be false.  So clients have to be smart to deal with this punning; but life is hard and we know this is the wild wild web: the httpRange-14 dogma cannot be absolute.

In practice I would even argue that the "inconsistency" in this data is rarely a problem. Because applications will simply ignore the information that is irrelevant to them. Inconsistent information - from my perspective - is only a problem when a certain kind of reasoning is applied that specifically takes into account both facts and thus the "ex falsum quodlibet" problem of logics strikes. On the WWW - as you say - we will have to live with inconsistencies anyways. So better welcome them and find applications that do not propagate errors in the data easily :)

My argument here is also that there is not really a URI identity crisis, except if you do the "mistake" to have both a document and a concept behind "http://example.org/foo". DBPedia and other Linked Data servers have a IMHO clean approach to this problem:
- if you request http://example.org/foo - the text/html document, you are redirected to http://example.org/page/foo, which is the actual document that contains a human readable description of http://example.org/foo
- if you request http://example.org/foo - the RDF data, you are redirected to http://example.org/data/foo, which is the actual document containing the machine readable description of http://example.org/foo

Now if you want to speak about the human readable document or the RDF document, you can easily do so by using the respective URIs. The connection between the documents and the concepts is modelled using the HTTP redirect and thus clear to the client. From my perspective, this is a much cleaner and human-friendly approach to the problem than httpRange-14.

In our Linked Media Framework, we go even a step further by taking into account the MIME type. This will result in redirects like
- http://example.org/foo, Accept: text/html; rel=content -> http://example.org/content/text/html/foo
- http://example.org/foo, Accept: image/jpeg; rel=content -> http://example.org/content/image/jpeg/foo
- http://example.org/foo, Accept: application/rdf+xml; rel=meta -> http://example.org/meta/application/rdf+xml/foo
- http://example.org/foo, Accept: text/html; rel=meta -> http://example.org/meta/text/html/foo
In this case, all four documents are different descriptions of the person http://example.org/foo (e.g. a text, an image, an RDF document, and tabular metadata in HTML).


Btw, the above snippet is not inconsistent in itself. It would be if we would say that foaf:Person and foaf:Document are disjoint and apply some sort of advanced semantics (i.e. OWL, not RDF/RDFS) on it, something we implicitly do because we think the distinction is reasonable. But it is not explicitly stated.


> 
>>> When you dereference the URL for a person (such as .../561666514#), you get back RDF.  Our _expectation_, of course, is that that RDF will include some remarks about that person (.../561666514#), but there can be no guarantee of this, and no guarantee that it won't include more information than you asked for.  All you can reliably expect is that _something_ will come back, which the service believes to be true and hopes will be useful.  You add this to your knowledge of the world, and move on.
>> 
>> There I have my main problem. If I ask for "A", I am not really interested in "B".
> 
> But if one does accept the logic of httpRange-14, then 'A' is something like 'B#', and it is _impossible_, as a consequence of the way HTTP is defined, to dereference specifically 'A', and thus any client which exists in a world with httpRange-14 in it, must necessarily be able to deal with the fact that what is described in the response may not be precisely what it did the HTTP transaction on.
> 
> It presumably knows that it was asked to find out about 'A' = 'B#', so it can do its filtering process with that in mind, no?

In the case that I tried I was asking for .../sebastian.schaffert and I got back .../561666514#. There was no redirect and no information how sebastian.schaffert is related to 561666514#.

If I had requested .../561666514 and I got .../561666514# the situation could have been a bit simpler, but there are still a lot of points open. 

One of the most important ones is that the "#" character in URIs itself is interpreted differently depending on the syntax used and on the client that uses it:
- in HTML, it refers to the HTML anchor in the page, identified by the <a name="..."> tag
- in XML, it refers to the XML id of an element in the page, in a way that is often incompatible with RDF/XML (just imagine an "id" attribute on an RDF/XML property...)
- Web browsers often use the "#" on the client side to represent stateful information in Javascript
- with multimedia files, the "#" often identifies the fragment of the multimedia file 
  (see the work of the media fragments group: http://www.w3.org/2008/WebVideo/Fragments/WD-media-fragments-spec) 

So the semantics of the "#" itself is not very well defined and left to the browser, and in most cases it actually is used to identify a *fragment* and not a different thing. Using it to distinguish between document and object is in my opinion an abuse of the original specification.


> I agree, by the way, that we shouldn't expect that everyone in the world, including RDBMS diehards and junior programmers, should be expected to understand the formal subtleties here.  But that's what libraries and layers are for, surely.  You do understand the difference, so you write a thin layer which provides API-users with the information they expect.  Am I misunderstanding a constraint?

The argument with libraries does not really hold, because in the end developers have to understand the underlying model to be able to use the library. Noone can use a relational database without knowing the basic concepts of tables, and only with sufficient knowledge about the relational model are developers able to use relational databases correctly. Sophisticated libraries like Hibernate make it only easier for people who already understand the model.


> 
>> - the data I get back is not about the resource I requested (discussion above), because there are competing philosophies about httpRange-14 (which is IMHO a never ending problem, unsolvable and also unnecessary in most situations), because there are several different recommendations about how to publish data on the web, or because some service somehow decides that some other data might be more useful or interesting than the one I asked for
> 
> I'm prolonging this discussion because I'm trying to publish linked data myself (I just need to twist a few more SemWeb-sceptical arms), I believe I thoroughly understand the pattern and the point, and so I would be interested to find out if I've somehow drifted away from the mainstream.
> 
Not necessarily from the mainstream, which seems to have adopted mostly the httpRange-14. But there are also some people criticising this decision, e.g. the good summary at:
- http://dfdf.inesc-id.pt/tr/web-arch


Sebastian
-- 
| Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
| Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
| Head of Knowledge and Media Technologies Group          +43 662 2288 423
| Jakob-Haringer Strasse 5/II
| A-5020 Salzburg
Received on Monday, 3 October 2011 16:00:35 UTC