ISSUE-60: Guidelines needed for proper construction of vocabulary scheme and \'term\' URIs

ISSUE-60: Guidelines needed for proper construction of vocabulary scheme and 'term' URIs

http://www.w3.org/2006/07/SWD/track/issues/60

Raised by: Jon Phipps
On product: Recipes

The recipes don't spend very much time discussing best practices for URI
construction for vocabulary schemes and 'terms'.

Based on a request by the International Press Telecommunications Council:
http://lists.w3.org/Archives/Public/public-swd-wg/2007Sep/0023.html

They are seeking guidance wrt construction of URIs to identify taxonomy schemes
and "terms". They have two questions:

1.  Should we opt for "#" or "#_" or "/" or "?" or "?<foo>=" or 
    some other string as the scheme URI terminator?

2.  What mechanism should we adopt for constructing code URIs?
(By "Code URIs" they mean a mapping of URI to existing term codes: "...we
decided to continue using the large number of existing codes that are used in
News and related industries today.")

Below is the full text of the email that raised the questions:

Introduction
------------

Below is a draft statement of matters on which the International 
Press Telecommunications Council [1] seeks help from the W3C and 
from the broader Semantic Web community.

The statement hasn't yet been reviewed by the relevant IPTC 
groups, but to save time I'm sending it in draft form.  We would 
very much like to have these matters resolved by the time of the 
next IPTC meeting on 15-17 October 2007, in Prague.


Background
----------

The IPTC decided a few years ago that its new G2 family of News 
Exchange standards must be compatible with the Semantic Web.  We 
decided that:

1.  Terms from taxonomies used for News would be associated with 
    individual URIs.

2.  We would encourage the use of GRDDL to convert News marked 
    up with metadata into forms understood by SemWeb tools.

At the same time we decided to continue using the large number 
of existing codes that are used in News and related industries 
today.

To reconcile these two requirements (SemWeb plus existing codes), 
we chose an approach somewhat similar to QNAMEs, though with 
several significant differences. The approach is:

-  Codes exist within (coding) schemes.  Familiar examples are:
      ISO 4217 alpha codes
      ISO 4217 numeric codes
      ISO 3166-1 two-letter alpha codes
      ISO 3166-1 three-letter alpha codes
      ISO 3166-1 numeric alpha codes
      IETF BCP 47 language tags

    Possibly less familar examples are:
      CUSIPs (eg "037833100", Apple Computer)
      ISBNs(eg "0-321-18578-1", The Unicode Standard, Version 4.0)
      ISSNs (eg "0261-3077", The Guardian)
      SEDOLs (eg "0263494", BAE Systems)
      Valorens (eg "1203203", UBS)

-  Each coding scheme is associated with a URI.  That URI *must* 
   resolve to a resource (or resources) containing information 
   about the scheme.

-  Each scheme URI is locally mapped to a prefix.

-  There are almost no constraints on the values of codes.  For 
   example, a code may start with a digit.

-  A qualified code (QCODE) is expressed in the form:
      prefix:code

-  We shall define rules for how scheme URIs should be terminated.
   These rules may take the form of guidelines.

-  We shall define rules for the construction of a code URI from 
   the corresponding scheme URI and the code.  These rules may or 
   may not specify simple concatenation.

-  In the case of schemes controlled by the News industry, each 
   code URI *must* resolve to a resource or (content negotiated) 
   resources containing information about the code.

-  In the case of schemes used but not not controlled by the News 
   industry, each code URI *should* resolve to a resource or 
   (content negotiated) resources containing information about 
   the code.


Matters we need help with
-------------------------

1.  Should we opt for "#" or "#_" or "/" or "?" or "?<foo>=" or 
    some other string as the scheme URI terminator?

2.  What mechanism should we adopt for constructing code URIs?

    Simple concatenation would work for (made up) URIs such as:
       www.iptc.org/taxonomies/subjects#_
       www.iptc.org/taxonomies/subjects/
       www.iptc.org/taxonomies/subjects?
       www.iptc.org/taxonomies/subjects?code=

    It would not work for:
       www.iptc.org/taxonomies/subjects#
    as the resulting URI would not be legal for HTML if the code 
    started with a digit.

    The alternative is to inject some buffer string during the 
    construction of the code URI.  This would probably have to be 
    a fixed string for all News taxonomies, as the alternative of 
    retrieving (from the scheme URI?) per-scheme rules seems too 
    burdensome for the recipient.

    Such a string could be, eg "_", so allowing a scheme URI such 
    as:
       www.iptc.org/taxonomies/subjects#
    and a code URI such as:
       www.iptc.org/taxonomies/subjects#_12345678

    Alternatively, such a string could be, eg "#_", so allowing a 
    scheme URI such as:
       www.iptc.org/taxonomies/subjects
    and a code URI such as:
       www.iptc.org/taxonomies/subjects#_12345678

    The disadvantage of both approaches is that such a rule would 
    make it difficult for people to use scheme URIs such as:
       www.iptc.org/taxonomies/subjects/
       www.iptc.org/taxonomies/subjects?
       www.iptc.org/taxonomies/subjects?code=

3.  We would very much appreciate help in developing a GRDDL 
    script for our G2 standards.  Nearly two years ago we 
    developed a script to convert NewsML-G2 to RDF triples 
    (N-Triples).  We were not, however, able to figure out how to 
    handle statements about statements.  Note that for each piece 
    of descriptive metadata we support attributes such as:
       creator
       date modified
       confidence
       relevance
       why present

    Thus one can, losely speaking, express:

       On 7 September 2007, Reuters stated that this News item 
       has a subject of:
       -  George W. Bush (with 60% confidence)
       -  George H. W. Bush (with 40% confidence)

    We appreciate that the best way to handle statements about 
    statements may still be unresolved within the SemWeb 
    community.

4.  We request that the W3C and the broader Semantic Web 
    community take our requirements into consideration in the 
    development of new specifications and tools, and in the 
    enhancement of existing ones.  We are aware that some of 
    these assume particular URI formats, eg the presence of a "#" 
    as a separator or the absence of a digit after such a "#".

[1] http://www.iptc.org/

Thank you

Misha Wolf
News Standards Manager
Reuters

Received on Tuesday, 18 September 2007 15:18:16 UTC