RE: Comments on "Designing URI Sets for the UK Public Sector" from Young,Jeff (OR) on 2010-11-13 (public-lld@w3.org from November 2010)

From: Young,Jeff (OR) <jyoung@oclc.org>
Date: Fri, 12 Nov 2010 19:01:47 -0500
To: "William Waites" <ww@eris.okfn.org>
Cc: "public-lld" <public-lld@w3.org>, <richard.stirling@cabinet-office.x.gs.gov.uk>, <john.sheridan@nationalarchives.gsi.gov.uk>, <jeni@jenitennison.com>
Message-ID: <52E301F960B30049ADEFBCCF1CCAEF590A62F9BD@OAEXCH4SERVER.oa.oclc.org>
I appreciate Jodi's comment that my document was actually "readable".
Unfortunately, my clarity seems to be running on empty again. :-)

William, my comments are below:

> > (Real world Document)
> >
> > http://education.data.gov.uk/school/78
> >
> > (Generic) Document
> >
> > http://education.data.gov.uk/school/78/
> 
> I don't quite understand what a "Generic Document" is and the
> difference between presence and absence of a slash is very slight
> and likely to leat to confusion and bugs for people using the
> data.

Note that the CTOC document specifies a distinction between "Document"
and "Representation" resources. In "Cool URIs for the Semantic Web"
document , the terms "Generic Document" and "Web Document" are used
instead:

http://www.w3.org/TR/cooluris/#r303gendocument
http://www.w3.org/TR/cooluris/#oldweb 

The meaning of "Generic Document" is based on a W3C TAG decision known
as genericResources-53 described here:

http://www.w3.org/2001/tag/issues.html#genericResources-53
http://www.w3.org/2001/tag/doc/alternatives-discovery

Basically, a Generic Document is an identifiable content-negotiable Web
Document that has no representation of its own. (You could think of the
trailing slash as an indication of this fact.) This abstraction allows
servers to support browser (HTML) and semantic agents (RDF) from the
same HTTP URI. There is a one-to-many relationship between a generic
document and web document, which fits naturally with the pattern I've
suggested:

Generic Document: http://education.data.gov.uk/school/78/
Web Document: http://education.data.gov.uk/school/78/about.html
(application/xhtml+xml)
Web Document: http://education.data.gov.uk/school/78/about.rdf
(application/rdf+xml)
Web Document: http://education.data.gov.uk/school/78/about.json
(application/json)
etc.

> > (Web Document) Representation
> >
> > http://education.data.gov.uk/school/78/doc.rdf
> 
> Why not just 78.rdf, 78.html, etc?

There are a few reasons. The most important is to accommodate use cases
where multiple representations of the same "file extension" are
possible. For example:

Generic Document: http://viaf.org/viaf/108389263/
Web Document: http://viaf.org/viaf/108389263/viaf.html 
Web Document: http://viaf.org/viaf/108389263/viaf.xml 
Web Document: http://viaf.org/viaf/108389263/marc21.html  
Web Document: http://viaf.org/viaf/108389263/marc21.xml 
Web Document: http://viaf.org/viaf/108389263/unimarc.html 
Web Document: http://viaf.org/viaf/108389263/unimarc.xml

A more common example would be the potential for desktop and mobile
versions of HTML, both negotiable from the same Generic Document URI:

Generic Document: http://example.org/person/alice/ 
Web Document: http://example.org/person/alice/default.html 
Web Document: http://example.org/person/alice/mobile.html 

> > Definition of the scheme concept
> >
> > http://education.data.gov.uk/ontology/education/#School
> 
> The URI looks very strange. Obviously it is valid to have a
> # immediately following a / but it still looks very strange.

It's the functionality that's important here. As above, the trailing
slash indicates a Generic Document. This means the URI can be used to
support delivery of HTML and RDF representations:

Generic Document: http://education.data.gov.uk/ontology/education/
Web Document: http://education.data.gov.uk/ontology/education/about.html

Web Document: http://education.data.gov.uk/ontology/education/about.owl 

In this pattern, the generic resource is intended to be an OWL ontology.
By hanging the '#' off the generic document, the URI "works" as an
owl:Class identifier AND as an HTML anchor. This allows the OWL URIs to
be self-documenting. Here are two live example from VIAF:

<http://viaf.org/ontology/1.1/#Heading> a owl:Class .
<http://viaf.org/ontology/1.1/#hasEstablishedForm> a owl:ObjectProperty
.

> And I don't see why ontology/education wouldn't be the
> name of the ontology, with education.html being the human
> readable documentation, and education.rdf being the machine
> readable, and education#School being an identified fragment
> in those docs.

I want to keep the URI patterns and features for the model and
meta-model as similar as possible. "Ontology" should be treated as a
class just like "Road" with multiple content-negotiable representations.
There is a subtle difference that needs to be acknowledged, though.
Unlike an individual "Road", every individual owl:Ontology IS a Web
Document. If this is your perspective, then it would make sense to have
the "real world" URI return 302 (Found) instead of 303 (See Other). IMO,
a "real world" resource interpretation COULD be justified if you're like
me and believe that UML and OWL are different ways to represent an
abstract "model".

BTW, the CTOC document talks about "URI sets" "for 'Things' such as
schools, roads, legislation, locations, projects, events, and so on."
IMO, these should be modeled as OWL classes. The CTOC document also
talks about an "organization of URI sets into 'sectors (e.g. education,
transport, health)..." I think these should be modeled as ontologies.

I mentioned some reservations about education.data.gov.uk as a domain
name, and the token redundancy in examples like
http://education.data.gov.uk/ontology/education/ starts to illustrate
why. Here's some URIs refactored based on my interpretation of sectors
as ontologies:

<http://data.gov.uk/ontology/education/> a owl:Ontology .
<http://data.gov.uk/ontology/education/#School> a owl:Class .
<http://data.gov.uk/school/123> a
<http://data.gov.uk/ontology/education/#School> .

<http://data.gov.uk/ontology/transport/> a owl:Ontology .
<http://data.gov.uk/ontology/transport/#Road> a owl:Class .
<http://data.gov.uk/road/M5> a
<http://data.gov.uk/ontology/transport/#Road> .

etc.

If the UK is anything like OCLC, information and management of
individuals in such classes is scattered far and wide and it's too
idealistic to think they need to be identified and described in a new
and improved silo. Maybe someday. If the scattered systems all use the
same ontologies, though, this distributed information can be published
with some consistency and systematically linked instead. This may be
happening behind the scenes, but I'd be surprised.

> > List of scheme identifiers
> >
> > http://education.data.gov.uk/school/
> >
> > Set
> >
> > http://education.data.gov.uk/school
> 
> Again, this is a very big semantic difference for the presence
> or absence of a / to signal.

Browser users will never notice. They're used to web servers adding a
slash.

There are two resources related 1-to-1 here and neither one can be
avoided in Linked Data. If the difference with or without the slash is
too subtle, then I would argue that shunting them off to a top-level
path segment named "set" is too heavy handed. This gets back to my
concern about the idealism of a new and improved silo.

> The way most people would
> understand a trailing / is that it implies the string "index".
> I realise this isn't RDF semantics but it is the behaviour
> that everyone who has ever done any web development will
> expect.

I think there is good precedent for delivering default.html or
index.html depending on the circumstances. I haven't looked lately, but
Tomcat and Apache used to come preconfigured for both. (I don't mean to
suggest that Linked Data boils down to an Apache Web Server, of course.)

My pattern has two Generic Resources that both end with a trailing
slash. It would be sensible to configure index.html as a default HTML
representation for the "Set" and default.html as the default HTML
representation for a "Document".

> So why not school/schemes and school/all or something?
> (along with schemes.html, schemes.rdf, all.html, all.rdf etc)?

We need to be careful about semantic collisions in the URI hierarchy.
The CTOC and my patterns are consistent in the {concept}/{reference}
pattern (what I would call {class-name}/{instance-name}). Reserving
tokens from the set of possible instance tokens is problematic which is
why the CTOC and my patterns avoid it.

> > Also note that their URI pattern recognition for "(Web Document)
> > Representation" depends on the trailing path segment starting with
> the
> > letters "doc.". This is a serious limitation, IMO, caused by their
> > willingness to stack concept/reference pairs in their URI. This
> > limitation could be avoided by coining a formulaic or opaque token
> for
> > the individual instead. (Roads and junctions have a nasty habit of
> > changing "names" over time, so maybe opaque tokens would be better
in
> > these cases.)
> 
> This is clearly not a problem that it unique to DGU. The
> problem with opaque identifiers is they don't make sense to
> humans.
> 
> 	http://ckan.net/package/statistics-data-gov-uk
> 
> is a lot better than
> 
> 	http://ckan.net/package/b37a8465-e94f-4c84-95b9-dc3c2b2e1066
> 
> but the former as you rightly point out may change.

I'm arguing for a URI pattern, so I think that transparent vs. opaque
needs to be considered class-by-class in the context of use cases. I
like transparent tokens if they can be mapped to an ontology or some
other explainable principle. UUIDs are OK if you have to pull an
instance identifier out of thin air, but in general I would prefer
sequential numbers if nothing better will do. People often need to type
these things in my hand.

> > Their stacked (Real world) Identifier:
> > http://education.data.gov.uk/id/road/M5/junction/24
> >
> > Formulaic alternative: http://education.data.gov.uk/junction/M5-24
> 
> (s/education/transport/)
> 
> I agree your alternative here is more succinct and better for that
> reason, but I'm not sure it solves the opaque and unreadbale vs.
> plastic and memorable problem.

I think hackability is important, which is why I think it's important
for these "URI Types" be structured hierarchically based on generalized
principles rather than scattered across different top-level path
segments. I wouldn't be offended if somebody stepped in and overrode a
token here or there, but IMO this should be the exception rather than
the rule. We have too much real work to do. ;-)

Jeff

> As I said, I think our approaches are very similar (modulo the
> bit about the trailing /).
> 
> Cheers,
> -w
Received on Saturday, 13 November 2010 00:08:23 UTC