Re: INSEE releases OWL ontology and RDF data for geographical entities from Dan Connolly on 2006-08-04 (public-xg-geo@w3.org from August 2006)

From: Dan Connolly <connolly@w3.org>
Date: Fri, 04 Aug 2006 16:53:10 -0500
To: Eric van der Vlist <vdv@dyomedea.com>
Cc: Bernard Vatant <bernard.vatant@mondeca.com>, semantic-web@w3.org, public-xg-geo@w3.org, Franck Cotton <franck.cotton@insee.fr>
Message-Id: <1154728391.30621.67.camel@dirk.w3.org>
On Fri, 2006-08-04 at 22:43 +0200, Eric van der Vlist wrote:
> Dan,
> 
> Le vendredi 04 août 2006 à 09:32 -0500, Dan Connolly a écrit :
> > On Fri, 2006-08-04 at 09:28 +0200, Eric van der Vlist wrote: 
> > > Hi,
> > 
> > Hi Eric,
> > 
> > > Le jeudi 03 août 2006 à 23:26 +0200, Bernard Vatant a écrit :
> > > > 
> > > > Dan
> > > > > did you consider using # rather than /? i.e.
> > > > >   http://rdf.insee.fr/geo#code_commune
> > > > > rather than
> > > > >   http://rdf.insee.fr/geo/code_commune
> > > > > especially for ontologies, it's a lot easier to manage.
> > > > >   
> > > > We did consider. Actually my first version of the ontology used a #
> > > > namespace. Eric (in cc)  was the one who suggested a / namespace,
> > > > especially for the data and somehow convinced the rest of us. That was
> > > > six months ago, but if I remember correctly, the idea was that at some
> > > > point, each instance URI would  be (should be, hopefully will be)
> > > > associated  with, and access to, a  separate resource, which is not
> > > > the case now. 
> > > 
> > > Yes, that was the first comment I did on your first proposal end of
> > > January.
> > > 
> > > The idea was that to identify a city, http://rdf.insee.fr/geo/COM_80078
> > > is better than http://rdf.insee.fr/geo#COM_80078.
> > 
> > You might also consider http://rdf.insee.fr/geo/COM_80078#city for
> > the city itself and http://rdf.insee.fr/geo/COM_80078 for a document
> > about the city.
> > 
> > If the cities come in natural chunks, perhaps
> > http://rdf.insee.fr/geo/COM_800#city78
> > for the city and http://rdf.insee.fr/geo/COM_800 for a document about
> > the cities in some region.
> 
> You mean that we should use the same URI to identify geographical
> entities and locate the fragment where there are defined?

Yes.

> We have rejected this idea for a number of reasons. I think that the
> most important of these reasons is that it would assume that the entity
> is described at only one location in only one RDF document

I don't follow you. What suggests that the entity is described
at only one location?

When using URIs of the form DOC#TERM, naturally the information
resource DOC is privileged in a way, but other documents
can say stuff about DOC#TERM too.

>  and that's
> not true in our case.
> 
> If you take an entity such as a city, this entity can be located over
> two higher level entities and its description is then split between the
> different higher level entities to which it belongs.
> 
> Even when a city belongs to only one higher level entities, important
> pieces of its description can be found in the description of the
> different layers of higher level entities and the description of
> entities such as department is spread over four different documents.

Of course.

> We also think that splitting entities into RDF documents is a packaging
> issue that may evolve over time and shouldn't impact the URIs
> identifying the entities.

Well, I hope you'll at least consider the linked data
pattern (http://www.w3.org/DesignIssues/LinkedData ) as a
particularly useful packaging mechanism.

> Furthermore, we believe that hard coding the links between entities
> identifiers and RDF documents would make the version management of these
> documents more complex.

More complex than what?

>  We have included a year in the URIs for the RDF
> documents so that we can easily publish new versions and keep the
> previous one (an "old" version carries valid information about the
> ontology for a specific date and we think that it should remain online).
> And of course, we wouldn't like that the URIs identifying the entities
> change over time.

I would like my computers to be small, fast, and cheap, too.
But I have to choose 2. I think choosing URIs and versioning
the related representations is quite similar.

> > >  Of course, these URIs 
> > > are only identifiers but who konws, we might want some day to publish
> > > some kind of documentation (like we do in RDDL to document namespaces)
> > > at these URIs. 
> > 
> > "only identifiers"? sigh. I got the impression you wanted to publish
> > information about them in the Semantic Web.
> 
> These are semantic information conform to the W3C recommendations and
> published on the World Wide Web. Isn't it sufficient to be part of the
> Semantic Web? 

No.

It would be pretty boring if W3C published a specification, gave
it a URI of http://www.w3.org/TR/2006/wd-xyz/ but required
that you use some other URI to get a copy of the spec and
gave a 404 at /TR/2006/wd-xyz/ . This would conform to the
letter of the HTTP, URI, and HTML specs, but not the spirit.
If everybody who published stuff on the Internet used
different URIs for naming and location, we wouldn't have a Web
of linked resources with network effects.

The spirit of the HTTP, URI, and HTML specs, i.e. the architectural
principles that bind them together into a useful web, have now
gone thru the W3C Recommendation process too:

 "A URI owner SHOULD provide representations of the resource it
identifies"
 -- http://www.w3.org/TR/webarch/#pr-describe-resource


The 404 problems I reported show that these INSEE data
don't conform to the Web Architecture Recommendation
(unless there's some justification for the 404 errors
that I haven't seen.)


> > > If we do so, the first URI makes each city a standalone entity while the
> > > second one means that they need to be fragments in a huge document which
> > > can cause a lot of issues (we don't know which media types we might want
> > > to publish and the definition of fragments is inconsistent between media
> > > types
> > 
> > It's within your control to choose media types where the definition
> > of fragments is consistent. The easiest way is to just use one
> > media type: application/rdf+xml .
> 
> What we have in mind for these URIs isn't necessarily limited to RDF but
> could include XHTML documentations or other kind of resources. Both RDF
> and XHTML can be published at the same location using content
> negotiation... What I meant by being inconsistent between media types is
> that if you use content negotiation you need to make sure that each
> content has the same fragments which is a further complication.

Yes, but it's not an insurmountable complication; the result is
not necessarily inconsistent.


> BTW, If we ever serve RDF at these addresses, I guess that it would
> kind of placeholders with seeAlso attributes to point to the different
> documents in which an entity is described rather than the actual
> definition of the entity.

Well, I suppose that's one way to go about it. I'm curious, though:
why would you go about it that way?

> > >  (some of them don't even support fragments), the document might
> > > grow very large, ...). 
> > > 
> > > Now, the thing that we've not considered is to have a namespace URI
> > > different from the RDF base.
> > > 
> > > > Agreed, we could have kept the # namespace for the ontology at least.
> > > 
> > > Dan, can you elaborate why that makes ontologies a lot easier to manage?
> > 
> > Because with a # namespace, publishing the ontology just involves
> > sticking one static file on a web server. (the URI looks nicer
> > if the web server can handle leaving the .rdf or .owl off, but
> > that's not completely essential).
> > 
> > And then to look up http://rdf.insee.fr/geo#code_commune , a consumer
> > just GETs http://rdf.insee.fr/geo as usual; then when they want
> > to look up another term such as http://rdf.insee.fr/geo#subdivision,
> > they can save a round trip because they already have it.
> > 
> > Using a / namespace has a higher cost for the producer (redirects)
> > and for the consumer (one GET per term rather than one GET
> > for the ontology).
> 
> That's true only if you assume that these identifiers are also used as
> locations...

Yes, that's how URIs work in Web Architecture.

> I know that this is a highly controversial debate,

Well, perhaps. It seems to me that the controversy is largely
over; the Web Architecture document has been all the way
thru the W3C Recommendation process. The URI spec, RFC3986,
is an IETF draft standard.

"A URI can be further classified as a locator, a name, or both."
  -- http://www.gbiv.com/protocols/uri/rfc/rfc3986.html#URLvsURN

I would like to think that these issues have been discussed
exhaustively and that the architectural principles are now
well established.

>  but I have always
> thought that the big advantage of RDF over XML vocabularies such as
> XLink is that it differentiates the two notions and I wouldn't want to
> loose this benefit!

The logical mechanisms of RDF are agnostic on the issues of best
practices for publishing data in the web. But the principles
of Web Architecture apply to URIs in RDF as well as in any
other data form.

> 
> Thanks your clarifications!

Likewise, thanks for providing background on the INSEE data.

> Eric
> 
-- 
Dan Connolly, W3C http://www.w3.org/People/Connolly/
D3C2 887B 0F92 6005 C541  0875 0F91 96DE 6E52 C29E
Received on Friday, 4 August 2006 21:53:37 UTC