Analysis by cases of hashless URIs from Henry S. Thompson on 2011-10-30 (www-archive@w3.org from October 2011)

From: Henry S. Thompson <ht@inf.ed.ac.uk>
Date: Sun, 30 Oct 2011 21:40:19 +0000
To: www-archive@w3.org
Message-ID: <f5baa8i2m64.fsf@calexico.inf.ed.ac.uk>
Let's enumerate hash-free absolute URI usage contexts and
constituencies:

A) URIs used in actionable contexts (either user-mediated, as in <a
   href="...">, or unmediated, as in <img src="..."> or <script
   href="..."/>) in order to trigger retrieval of interpretable
   'documents', which are then interpreted according to their media
   type.  Neither requestors nor provisioners care about what they
   identify, but presumably they identify whatever the media type
   thereof says is the 'meaning' of the retrieved
   message. Provisioners almost always report 200 or 404, occasionally
   302.  The division between 'documents' with presentations,
   e.g. text/html, image/jpg, audio/mp3 on the one hand and
   'documents' with non-directly perceivable procedural consequences,
   e.g. text/css, text/javascript on the other is unknown, but,
   particularly if we count types rather than tokens, probably heavily
   biased in favour of the first group.  The vast majority of
   retrievals are done by browsers, presumably crawlers come next.

B) URIs are used in referential contexts (RDF/XML, RDFa, Turtle, N3)
   to identify subjects, relations or objects of RDF triples.  In
   principle there could be other (non-RDF) referential contexts---we
   can for example imagine a version of KLONE which uses URIs for
   identifiers.  Retrieval is by definition speculative, in quest of
   more triples (or other context-appropriate descriptive material,
   e.g. more URI-KLONE).

   From the provisioning side, there is a moderately complicated tree
   of cases:

     1) Nothing in known: 404, no problem

     2) Thing identified is not an information resource and some
        description 'document(s)' is/are available
        a) Return a 303 plus Location: [of the description 'doc']
           (See e.g. http://dbpedia.org/resource/Albert_Einstein);

        b) Return a 200 plus (one of) the description 'doc(s)', 
           possibly including either or both of
             Content-location: [uri of IR for the description doc]
             <wdrs:describedby rdf:resource=' " '/>
           (See e.g. http://iandavis.com/2010/303/toucan,
           http://schema.org/Thing,
           http://www.uk-postcodes.com/postcode/EH125BB [Accept:
           a/rdf+xml],
           http://lsid.tdwg.org/urn:lsid:ubio.org:namebank:11815);

           i) Return a 302 plus Location: [of a description 'doc']
              (See (!) e.g. http://purl.org/dc/elements/1.1/identifier);

     3) Thing identified is an information resource but only a/some
        description 'document(s)' is/are available
         (No known examples, but imagine e.g. some RDF about the 2020
          US census report)
        Same alternatives as (B2a) and (B2b)

     4) Thing identified is an information resource _and_ a/some
        description 'document(s)' is/are available.  200 + a
        representation is the only possible result.  The description
        may be embedded in the representation as RDFa if the
        representation is XML or HTML (see
        e.g. http://sercompetitivos.com/?ibsa=share&id=1590,
        http://www.somebits.com/weblog/culture/blogs/ccLicense.html),
        or synthesised from one or more <link rel=...> or meta
        rel=...> elements in the <head> (see
        e.g. http://en.wikipedia.org/wiki/Organelle,
        http://lod.geospecies.org/bioclasses/aQado.xhtml), if the
        representation is HTML*.  Regardless of media type other
        discovery mechanisms have been canvassed, including Link:
        headers, .well-known provision, etc., but I'm not aware of
        _any_ examples of these in use.

        How well any given tool which gets such a response does at
        tracking down/locating the description varies widely, I expect.

     5) Thing described is an information resource, a/some description
        'document(s)' is/are available, but they describe a landing
        page URI, not the URI for the resource itself.  Both the
        landing page and the resource are served with 200 + a
        representation, which in all examples I'm aware of carries the
        description embedded as RDFa (see
        e.g. http://www.flickr.com/photos/62234213@N00/354736733, 
        [couldn't find a journal article landing page example])

        Not aware of any tool which is capable of sorting out the
        confusion here.

A of course hugely dominates B numerically.  Within B, obviously a lot
of B5 because of flickr.  B2 is LOD heartland, B3 doesn't raise any
issues that B2 doesn't.  The majority of the B4 cases are harmless,
because the representation is HTML, the resource is an
ordinary-language:document and there's no other referent in the
picture.  The subcase where the representation is RDF (or N3 or
Turtle...) and there _are_ two resources in play is rare (?).

Hmm.  Sindice finds 300K RDF pages with cc:license statements.

First one is 

 http://carpictures.cc/cars/photo/car_picture/13037/grey_mercedes_rear_license_plate_blank.rdf

which actually illustrates the landing page problem, not the pun
problem.  That URI yields an HTML page with a picture of a
merc. embedded in it and

<link rel="alternate" type="application/rdf+xml" title="RDF/XML
Representation"
href="http://carpictures.cc/cars/photo/car_picture/13037/grey_mercedes_rear_license_plate_blank.rdf"
/>

The RDF itself includes

<http://carpictures.cc/cars/photo/car_picture/13037/grey_mercedes_rear_license_plate_blank>
<http://creativecommons.org/ns#license>
<http://creativecommons.org/licenses/by/2.0>

And there is also

 http://rdf.ecs.soton.ac.uk/degree/csInt

which illustrates a careful pattern -- that document contains
assertions about both

 itself, that is http://rdf.ecs.soton.ac.uk/degree/csInt

including a cc:license and rdfs:type Ontology

as well as assertions about

 what it denotes, that is http://id.ecs.soton.ac.uk/degree/csInt

including rdfs:type ...:...Degree and ...:hasCohort

And in interesting 3rd-party pun mistake turns up in

 http://purl.org/derecho

where we find

 <http://dbpedia.org/resource/Law>
   <http://creativecommons.org/ns#license>
 <http://creativecommons.org/licenses/by-nc-sa/3.0/es>

Which is wrong -- that's the concept -- the predicate is only true of
http://dbpedia.org/page/Law and http://dbpedia.org/data/Law

And here's a case where the pun is OK!  That is, the predication is
true on _either_ reading.  In

 http://purl.org/NET/cidoc-crm/core

we have
 
 <http://purl.org/NET/cidoc-crm/core>
  <http://creativecommons.org/ns#license>
 <http://creativecommons.org/licenses/by/3.0/>

_and_

 <http://purl.org/NET/cidoc-crm/core>
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
 <http://www.w3.org/2002/07/owl#Ontology>

But it's OK to have the license apply either to the document or to the
ontology it describes.

But finally we also have

 http://squio.nl/blog/triplify/user/1

which contains

 <http://squio.nl/blog/triplify/user/1>
  <http://creativecommons.org/ns#license>
 <http://creativecommons.org/licenses/by/3.0/us/>

as well as

 <http://squio.nl/blog/triplify/user/1>
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
 <http://xmlns.com/foaf/0.1/Person>

Bingo.

But I had to look pretty hard to find that (about 90 minutes looking
at Sindice results).

Sigh, this is much too long and diffuse, but it's a record of how I
spent several hours a day for the last three days.  More thoughts
about all this when I can. . .

ht

* Searching for "cell organelles" with the free to use box ticked in
  Google advanced search, the hits break down about 60-40 ones with <a
  href="[cc license]"> in the HTML somewhere, and ones with <link
  rel="license|copyright" href="...cc..."/>.  The former don't count
  as far as I'm concerned, since none of them are recognisable as RDFa
  (they lack rel="(cc:)license").

  There are relatively few <a rel="cc:license"> on any pages -- about
  46K according to Sindice, somewhat fewer rel="license" -- about 17K.
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@inf.ed.ac.uk
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]
Received on Sunday, 30 October 2011 21:40:49 UTC