Re: vision for controlled vocabulary use and management from Ron Davies on 2004-11-01 (public-esw-thes@w3.org from November 2004)

From: Ron Davies <ron@rondavies.be>
Date: Mon, 01 Nov 2004 20:25:47 +0100
To: public-esw-thes@w3.org
Message-Id: <6.0.0.22.2.20041008135839.01ca61e0@pop.skynet.be>
Alistair,

I had started to prepare a response to your post a few weeks ago, but you 
had raised indirectly so many issues that I got discouraged with trying to 
deal with them all. Dagobert's response has given me new courage to try to 
address some of these issues,

>(1) Concept-Oriented Design and Construction

I'm not really sure what you mean by concept-oriented design and 
construction, except that you want applications to use concept identlfiers 
rather than terms from natural languages to identify a concept.

If you mean something more--  perhaps some notion of Platonic concepts that 
exist independently of language?--  then there are certainly some thorny 
philosophical issues here. I am much less sanguin than Dagobert is about 
the ease with which we can separate concepts from language, at least in the 
"soft" social sciences. If you have look at any multilingual thesaurus, you 
soon run into situations where the underlying conceptual structures appear 
to be different in two different languages because the words that they use 
are not coherent. Is one conceptual structure right and the other wrong? 
How do we tell? Most often, you have to adopt one or other of the 
conceptual structures, and then try desperately to make the terms from the 
other language fit and hope that your poor users are not utterly confused. 
The thesaurus standards are full of examples of rather unsatisfactory ways 
you can try to get around this problem.

If you don't mean this, but you simply want applications to use concept 
identifiers, I'm not sure this is a major issue. Whether in a particular 
system a concept identifier has been entered into an indexing record or an 
alphabetic string representing a preferred term label really doesn't matter 
very much. It isn't more "concept-oriented" to rely on an identifier 
code--  a code is simply an identifier in another (indexing) language. 
Identifier codes have certainly been used in indexing applications, 
particularly in environments where synonym rings are required. (I can't 
remember the name of the system, but at an architectural information centre 
in Washington twelve or fifteen years ago I saw such a system, developed I 
think with the participation of the Getty Art History Information Program. 
All authority control was handled through relational links).

Whether to use identifier codes in an indexing/retrieval application or not 
depends on very practical considerations in the design of the 
indexing/retrieval system (e.g. what kind of data you want to expose to 
others, and where further processing of the data is done). For example, you 
can expose data with a code (an artificial language), and expect the client 
to do a lookup to substitute a preferred term label in some natural 
language, or you can provide the preferred term label itself and expect the 
client to do the translation into other natural languages, or you can 
expose data with all of the preferred terms in the various natural 
languages. It's true that using a code makes a few internal changes easier 
to implement (spelling changes, or swaps between preferred and 
non-preferred terms) but these are only a small percentage of the changes 
that take place.  And those changes are easy to implement even in a system 
that uses natural language terms to identify the concept as long as the 
system supports a global change operation.

>(3) Concept-Oriented Maintenance & Management
<snip>
>This means that, if an authority wishes to
>significantly refactor/reorganise/redefine some of its concepts, this is
>best done by defining and publishing some new concepts and new concept
>identifiers.

Again, as Dagobert points out, this is far from simple. The devil here is 
in the details. What is a significant change? How do we wish to update the 
indexing data?

For example, we might have
- swapping a non-preferred term ("data sticks") with a preferred term ("USB 
keys"). The concept is the same, it is simply the label has altered. In 
other words, the meaning of the concept hasn't changed.
- swapping a spelling or dialectal variant ("labor") with a preferred term 
("labour"). Again the meaning hasn't changed.
- swapping an abbreviation ("AIDS") with a full form ("Auto-immune 
deficiency syndrome"). Ditto.

The above all seem to be semantically neutral, i.e. the concept hasn't 
changed. However consider:

- adding a scope note ("Use this only for X. For other cases, use the new 
term Y".) The meaning has changed, but the expression of the concept, i.e. 
the preferred term, hasn't.
- adding a history note ("Used up until 2004. After 2004, use W or Z").
- two concepts are merged into a single concept
- a concept is split into two concepts
- an non-preferred term is removed
- a non-preferred term is removed because it's become its own term, i.e. 
there really is a new _concept_. "Tropical products" loses the UF 
"Bananas", because "Bananas" is now a new concept.
- adding a new BT or a new NT or a new RT or deleting one or more of these.

Which of these represents a new concept? Or do they all do? How are we to 
update the indexing data, e.g. in the case of a split?

>Replacement relationships may be then defined between the old
>concepts and the new, which would support perfect interoperability between
>systems employing old and new concept sets, and would also support automated
>updating of indexing metadata.

Traditionally thesauri have usually used versioning to control differences 
in conceptual structure. In other words, a thesaurus is published in one 
version, changes are made "offline" and then a new version is produced and 
published. This has the advantage of organizing work processes, allowing 
for users to get familiar with a particular conceptual structure, and 
permitting developers to check each version for conceptual coherence. Links 
between concepts in one version and another version (the electronic version 
of the traditional Additions and Changes Lists) could be indicated by 
mapping from one to another just as we map from one thesaurus to another. 
The mapping could then be applied to update the indexing applications, 
where the update can be done automatically, i.e. where it is simple. (It 
can't in all cases, as anyone who has done this kind of work can confirm.)

One of the reasons I mention this is that trying to carry in a concept 
record _all_ the past history via replacement relationships seems to me to 
complicate enormously the structure as well as leading to semantic 
difficulties. And a complicated structure, which is difficult for people to 
understand (even if they aren't often asked to do so) will turn people off 
using SKOS (which is meant to be Simple). Whereas publishing this 
information as a mapping (for which there is already a structure defined) 
is much simpler for mere humans to understand.

Anyway, I hope these few thoughts help. These are difficult issues to try 
to tackle in an online discussion.

Ron
>-----------------------------------------------
>Ron Davies
>Information and documentation systems consultant
>Av. Baden-Powell 1  Bte 2, 1200 Brussels, 
>Belgium       Email:  ron@rondavies.be
>Tel:    +32 (0)2 770 33 51
>GSM:    +32 (0)484 502 393
Received on Monday, 1 November 2004 19:26:31 UTC