Re: URI policy for thesaurus concepts

* Thomas Bandholtz <thomas@bandholtz.info> [2004-04-30 22:01+0200]
> 
> 
> Alistair > One thing a GET request for the thesaurus URI should definitely
> return is a
> > description of that thesaurus (i.e. name, version, creators, description
> of
> > scope and content etc.) although again whether that should be machine or
> > human readable is open.
> 
> Obviously *both* must be possible.We had megabytes of discussion on this in
> the Topic Map community and elsewhere.

Sure, both are possible and will remain so. Our focus here, though, is
on the machine readable aspect (and on making machine interfaces that 
are adequate to support an interesting and usable range of human
interfaces).

> The answer is very simple - there are humans, and there are machines
> (software agents).
> A service may decide to serve only one of them, but she may decide to serve
> both.
> She may even decide to serve several machine protocols or several human
> readable layouts.
> The consequence is that a singe URL is not enough.

That doesn't necessarily follow. HTTP supports content negotiation 
(see http://www.w3.org/Protocols/ ftp://ftp.isi.edu/in-notes/rfc2616.txt) 
which allows multiple representations of the same thing to be made 
accessible via a common URI.

> We need pairs of protocol-URL such as
> 
> HTML -> http://human.blah.org/thesaurus.html
> WSDL -> http://services.blah.org/thesaurus.wsdl
> DCMI -> http://dcmi.blah.org/thesaurus.xml
> etc., etc.,
> 
> these must be explictly *pairs* (protocol -> URL) as the domain name must
> not contain any significant meaning itself (see RFC URI)

We can also use XML namespace mixing to make multiple kinds of
information available within a common piece of markup (eg. RDF inside
XHTML, or RDF styled into XHTML using XSLT). There are a lot of options
to explore.

> Alistair > The other question is, should the request for the thesaurus URI
> also return
> > the entire content of the thesaurus?  Personally I think no, but again I'm
> > not sure about that.
> 
> It never should by default!! A well established thesaurus easily counts
> 100.000s and more concepts! The requester (be it human or machine) must be
> able to identify the thesaurus source without downloading the whole thing.

Yep, I agree, downloading the entire database will be relatively rare.

> I my personal vision, the "whole thing" *never* will be downloaded at once:
> avoid redunancy, and what the hell are we doing here? ---
> We are establish means to *link to specific* concepts and make clear where
> the come from.

Search engine apps may well find value in downloading the entire thing.

As host to http://xmlns.com/wordnet/1.6/ I have noticed that some people
have tried to crawl the entire dataset with repeated HTTP requests,
presumably so they can populate a local database for query etc. I'd like
to have conventions for giving them the entire dataset in a more
efficient manner.

Dan

> Downloading and so duplicating a thesaurus is OK in some situations, but
> this should be regarded as a very special use case.
> 
> Thomas
> 
> 

Received on Saturday, 1 May 2004 11:02:12 UTC