Re: IPTC asks W3C for help in the integration of News into the SemWeb from Ed Summers on 2007-09-11 (public-swd-wg@w3.org from September 2007)

From: Ed Summers <edsu@loc.gov>
Date: Tue, 11 Sep 2007 10:16:39 -0400
To: <public-swd-wg@w3.org>
Message-Id: <46E66B0702000015003B59C9@ntgwgate.loc.gov>

Some of us at the Library of Congress are also investigating making code lists such as ISO 639-2 [1] available in a semweb friendly manner. I think a lot the questions Misha brought up in relation to the notion of a QCODE are also of great interest to us.

//Ed

[1] http://www.loc.gov/standards/iso639-2/

>>> Misha Wolf <Misha.Wolf@reuters.com> 09/07/07 4:11 PM >>>

Introduction
------------

Below is a draft statement of matters on which the International
Press Telecommunications Council [1] seeks help from the W3C and
from the broader Semantic Web community.

The statement hasn't yet been reviewed by the relevant IPTC
groups, but to save time I'm sending it in draft form. We would
very much like to have these matters resolved by the time of the
next IPTC meeting on 15-17 October 2007, in Prague.

Background
----------

The IPTC decided a few years ago that its new G2 family of News
Exchange standards must be compatible with the Semantic Web. We
decided that:

1. Terms from taxonomies used for News would be associated with
individual URIs.

2. We would encourage the use of GRDDL to convert News marked
up with metadata into forms understood by SemWeb tools.

At the same time we decided to continue using the large number
of existing codes that are used in News and related industries
today.

To reconcile these two requirements (SemWeb plus existing codes),
we chose an approach somewhat similar to QNAMEs, though with
several significant differences. The approach is:

- Codes exist within (coding) schemes. Familiar examples are:
ISO 4217 alpha codes
ISO 4217 numeric codes
ISO 3166-1 two-letter alpha codes
ISO 3166-1 three-letter alpha codes
ISO 3166-1 numeric alpha codes
IETF BCP 47 language tags

Possibly less familar examples are:
CUSIPs (eg "037833100", Apple Computer)
ISBNs(eg "0-321-18578-1", The Unicode Standard, Version 4.0)
ISSNs (eg "0261-3077", The Guardian)
SEDOLs (eg "0263494", BAE Systems)
Valorens (eg "1203203", UBS)

- Each coding scheme is associated with a URI. That URI *must*
resolve to a resource (or resources) containing information
about the scheme.

- Each scheme URI is locally mapped to a prefix.

- There are almost no constraints on the values of codes. For
example, a code may start with a digit.

- A qualified code (QCODE) is expressed in the form:
prefix:code

- We shall define rules for how scheme URIs should be terminated.
These rules may take the form of guidelines.

- We shall define rules for the construction of a code URI from
the corresponding scheme URI and the code. These rules may or
may not specify simple concatenation.

- In the case of schemes controlled by the News industry, each
code URI *must* resolve to a resource or (content negotiated)
resources containing information about the code.

- In the case of schemes used but not not controlled by the News
industry, each code URI *should* resolve to a resource or
(content negotiated) resources containing information about
the code.

Matters we need help with
-------------------------

1. Should we opt for "#" or "#_" or "/" or "?" or "?<foo>=" or
some other string as the scheme URI terminator?

2. What mechanism should we adopt for constructing code URIs?

Simple concatenation would work for (made up) URIs such as:
www.iptc.org/taxonomies/subjects#_
www.iptc.org/taxonomies/subjects/
www.iptc.org/taxonomies/subjects?
www.iptc.org/taxonomies/subjects?code=

It would not work for:
www.iptc.org/taxonomies/subjects#
as the resulting URI would not be legal for HTML if the code
started with a digit.

The alternative is to inject some buffer string during the
construction of the code URI. This would probably have to be
a fixed string for all News taxonomies, as the alternative of
retrieving (from the scheme URI?) per-scheme rules seems too
burdensome for the recipient.

Such a string could be, eg "_", so allowing a scheme URI such
as:
www.iptc.org/taxonomies/subjects#
and a code URI such as:
www.iptc.org/taxonomies/subjects#_12345678

Alternatively, such a string could be, eg "#_", so allowing a
scheme URI such as:
www.iptc.org/taxonomies/subjects
and a code URI such as:
www.iptc.org/taxonomies/subjects#_12345678

The disadvantage of both approaches is that such a rule would
make it difficult for people to use scheme URIs such as:
www.iptc.org/taxonomies/subjects/
www.iptc.org/taxonomies/subjects?
www.iptc.org/taxonomies/subjects?code=

3. We would very much appreciate help in developing a GRDDL
script for our G2 standards. Nearly two years ago we
developed a script to convert NewsML-G2 to RDF triples
(N-Triples). We were not, however, able to figure out how to
handle statements about statements. Note that for each piece
of descriptive metadata we support attributes such as:
creator
date modified
confidence
relevance
why present

Thus one can, losely speaking, express:

On 7 September 2007, Reuters stated that this News item
has a subject of:
- George W. Bush (with 60% confidence)
- George H. W. Bush (with 40% confidence)

We appreciate that the best way to handle statements about
statements may still be unresolved within the SemWeb
community.

4. We request that the W3C and the broader Semantic Web
community take our requirements into consideration in the
development of new specifications and tools, and in the
enhancement of existing ones. We are aware that some of
these assume particular URI formats, eg the presence of a "#"
as a separator or the absence of a digit after such a "#".

[1] http://www.iptc.org/

Thank you

Misha Wolf
News Standards Manager
Reuters

This email was sent to you by Reuters, the global news and information company.
To find out more about Reuters visit www.about.reuters.com

Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of Reuters Limited.

Reuters Limited is part of the Reuters Group of companies, of which Reuters Group PLC is the ultimate parent company.
Reuters Group PLC - Registered office address: The Reuters Building, South Colonnade, Canary Wharf, London E14 5EP, United Kingdom
Registered No: 3296375
Registered in England and Wales

Received on Tuesday, 11 September 2007 14:17:15 UTC