RE: vision for controlled vocabulary use and management from Bernard Vatant on 2004-11-05 (public-esw-thes@w3.org from November 2004)

From: Bernard Vatant <bernard.vatant@mondeca.com>
Date: Fri, 5 Nov 2004 15:18:10 +0100
To: "Ron Davies" <ron@rondavies.be>, <public-esw-thes@w3.org>
Message-ID: <GOEIKOOAMJONEFCANOKCAEPNFBAA.bernard.vatant@mondeca.com>

Ron

Addressing more or less what you write below:

"If you have look at any multilingual thesaurus, you soon run into situations where the
underlying conceptual structures appear to be different in two different languages because
the words that they use are not coherent. Is one conceptual structure right and the other
wrong? How do we tell? Most often, you have to adopt one or other of the conceptual
structures, and then try desperately to make the terms from the other language fit and
hope that your poor users are not utterly confused. The thesaurus standards are full of
examples of rather unsatisfactory ways you can try to get around this problem."

I've made a suggestion a week ago on SWBPD Vocbulary Management Task Force list
http://lists.w3.org/Archives/Public/public-swbp-wg/2004Oct/0185.html

In this post I try to figure out the various ways to tackle such concept identification
issues in multilingual environments, a question sadly overlooked in most of Semantic Web
venues ... as a matter of fact, I got no answer so far overthere :((

Bernard

**********************************************************************************

Bernard Vatant
Senior Consultant
Knowledge Engineering
bernard.vatant@mondeca.com

"Making Sense of Content" : http://www.mondeca.com
"Everything is a Subject" : http://universimmedia.blogspot.com

**********************************************************************************
-----Message d'origine-----
De : public-esw-thes-request@w3.org [mailto:public-esw-thes-request@w3.org]De la part de
Ron Davies
Envoye : lundi 1 novembre 2004 20:26
A : public-esw-thes@w3.org
Objet : Re: vision for controlled vocabulary use and management

Alistair,

I had started to prepare a response to your post a few weeks ago, but you had raised
indirectly so many issues that I got discouraged with trying to deal with them all.
Dagobert's response has given me new courage to try to address some of these issues,

(1) Concept-Oriented Design and Construction

I'm not really sure what you mean by concept-oriented design and construction, except that
you want applications to use concept identlfiers rather than terms from natural languages
to identify a concept.

If you mean something more-- perhaps some notion of Platonic concepts that exist
independently of language?-- then there are certainly some thorny philosophical issues
here. I am much less sanguin than Dagobert is about the ease with which we can separate
concepts from language, at least in the "soft" social sciences. If you have look at any
multilingual thesaurus, you soon run into situations where the underlying conceptual
structures appear to be different in two different languages because the words that they
use are not coherent. Is one conceptual structure right and the other wrong? How do we
tell? Most often, you have to adopt one or other of the conceptual structures, and then
try desperately to make the terms from the other language fit and hope that your poor
users are not utterly confused. The thesaurus standards are full of examples of rather
unsatisfactory ways you can try to get around this problem.

If you don't mean this, but you simply want applications to use concept identifiers, I'm
not sure this is a major issue. Whether in a particular system a concept identifier has
been entered into an indexing record or an alphabetic string representing a preferred term
label really doesn't matter very much. It isn't more "concept-oriented" to rely on an
identifier code-- a code is simply an identifier in another (indexing) language.
Identifier codes have certainly been used in indexing applications, particularly in
environments where synonym rings are required. (I can't remember the name of the system,
but at an architectural information centre in Washington twelve or fifteen years ago I saw
such a system, developed I think with the participation of the Getty Art History
Information Program. All authority control was handled through relational links).

Whether to use identifier codes in an indexing/retrieval application or not depends on
very practical considerations in the design of the indexing/retrieval system (e.g. what
kind of data you want to expose to others, and where further processing of the data is
done). For example, you can expose data with a code (an artificial language), and expect
the client to do a lookup to substitute a preferred term label in some natural language,
or you can provide the preferred term label itself and expect the client to do the
translation into other natural languages, or you can expose data with all of the preferred
terms in the various natural languages. It's true that using a code makes a few internal
changes easier to implement (spelling changes, or swaps between preferred and
non-preferred terms) but these are only a small percentage of the changes that take place.
And those changes are easy to implement even in a system that uses natural language terms
to identify the concept as long as the system supports a global change operation.

(3) Concept-Oriented Maintenance & Management
<snip>

This means that, if an authority wishes to
significantly refactor/reorganise/redefine some of its concepts, this is
best done by defining and publishing some new concepts and new concept
identifiers.

Again, as Dagobert points out, this is far from simple. The devil here is in the details.
What is a significant change? How do we wish to update the indexing data?

For example, we might have
- swapping a non-preferred term ("data sticks") with a preferred term ("USB keys"). The
concept is the same, it is simply the label has altered. In other words, the meaning of
the concept hasn't changed.
- swapping a spelling or dialectal variant ("labor") with a preferred term ("labour").
Again the meaning hasn't changed.
- swapping an abbreviation ("AIDS") with a full form ("Auto-immune deficiency syndrome").
Ditto.

The above all seem to be semantically neutral, i.e. the concept hasn't changed. However
consider:

- adding a scope note ("Use this only for X. For other cases, use the new term Y".) The
meaning has changed, but the expression of the concept, i.e. the preferred term, hasn't.
- adding a history note ("Used up until 2004. After 2004, use W or Z").
- two concepts are merged into a single concept
- a concept is split into two concepts
- an non-preferred term is removed
- a non-preferred term is removed because it's become its own term, i.e. there really is a
new _concept_. "Tropical products" loses the UF "Bananas", because "Bananas" is now a new
concept.
- adding a new BT or a new NT or a new RT or deleting one or more of these.

Which of these represents a new concept? Or do they all do? How are we to update the
indexing data, e.g. in the case of a split?

Replacement relationships may be then defined between the old
concepts and the new, which would support perfect interoperability between
systems employing old and new concept sets, and would also support automated
updating of indexing metadata.

Traditionally thesauri have usually used versioning to control differences in conceptual
structure. In other words, a thesaurus is published in one version, changes are made
"offline" and then a new version is produced and published. This has the advantage of
organizing work processes, allowing for users to get familiar with a particular conceptual
structure, and permitting developers to check each version for conceptual coherence. Links
between concepts in one version and another version (the electronic version of the
traditional Additions and Changes Lists) could be indicated by mapping from one to another
just as we map from one thesaurus to another. The mapping could then be applied to update
the indexing applications, where the update can be done automatically, i.e. where it is
simple. (It can't in all cases, as anyone who has done this kind of work can confirm.)

One of the reasons I mention this is that trying to carry in a concept record _all_ the
past history via replacement relationships seems to me to complicate enormously the
structure as well as leading to semantic difficulties. And a complicated structure, which
is difficult for people to understand (even if they aren't often asked to do so) will turn
people off using SKOS (which is meant to be Simple). Whereas publishing this information
as a mapping (for which there is already a structure defined) is much simpler for mere
humans to understand.

Anyway, I hope these few thoughts help. These are difficult issues to try to tackle in an
online discussion.

Ron

-----------------------------------------------
Ron Davies
Information and documentation systems consultant
Av. Baden-Powell 1 Bte 2, 1200 Brussels, Belgium Email: ron@rondavies.be
Tel: +32 (0)2 770 33 51
GSM: +32 (0)484 502 393

Received on Friday, 5 November 2004 14:18:23 UTC