W3C home > Mailing lists > Public > public-esw-thes@w3.org > October 2009


From: Thomas Bandholtz <thomas.bandholtz@innoq.com>
Date: Thu, 22 Oct 2009 22:40:05 +0200
Message-ID: <4AE0C325.6000802@innoq.com>
To: Johan De Smedt <Johan.De-smedt@tenforce.com>
CC: Stella Dextre Clarke <stella@lukehouse.org>, Antoine Isaac <aisaac@few.vu.nl>, SKOS <public-esw-thes@w3.org>
Hi Johan,
> Suggestion: There are three levels of organization.
> - Concepts (SKOS talk)
> - Labels
> - Text processing
Good idea!
I would add: Labels are skosxl, text processing is not yet really
covered by skos(xl), but can be supported by extending skosxl locally.
> A significant part of the issues discussed related to what is on the label management level
> and what is on the text processing level (thus needing a proper definition)
> Language specific text processing and analysis (including inflection)
> seems to me a specialized area for which global resource (language dictionalries)
> like word-net can solve.
http://wordnet.princeton.edu/wordnet/ starts with this sentence:
"WordNet® is a large lexical database of English". Right. We have more
than 20 languages in European GEMET. Believe me, when it comes to
language specific text processing, English is the most simple language.

> Stemmeng, also is in this area.
> It seems to me costly if this would be managed in every thesaurus.
It is costly, sure, but as I have expressed before, UMTHES has already
invested in this, and the question now is how to express the results in
a skosxl extension, but not: should UMTHES forget all the results of
this investment. You are right in one point: In general, a thesaurus
needs not to care about this. It is not a general requirement. But
language specific text processing needs to be solved on a language
specific level by someone somehow.
> Label management can focus on standard terms and term decomposition as relevant within a 
> thesaurus or taxonomy.  (equivalence relation, compound equivalence, acronym, 
> short-name, qualifiers ...)
Right so far. What we try to handle is: each of such terms (=labels) has
multiple spelling conventions, and a spelling variant does not make a
different term on the same level. May be this is specific to some
languages only and not such an issue in English.
> Indexing and search engines combining thesaurus and text processing should can use the label
> management layer (of the thesaurus) to configure the relevant text processing.
I think this needs a third, dedicated layer.
> Concept and label processing surely belong to the thesaurus/taxonomy/... management.
> Text processing, I would suggest, is in the text processing engines.
Right, but text processing engines need some structure to express the
diversity of term (Label) ocurrence in natural language.
> PS:
> - thanks for the UMTHES presentation - very instructive.
Thanks for the flowers, I tried hard to provide some valuable
contribution. As always, one has to surrender at some point of
complexity (just to be on time for the meeting) and leave the rest to
the next presentation, ...
> - would it be an idea to build on further SKOS extensions to have common schema for
>   artefacts like equivalence relation and compound equivalence; or for specializing
>   some xl:labelRelation ?
I think we should collect more examples and patterns, and we should not
try to harmonise this too striktly.
What we tried to implement in UMTHES: seperate a pure SKOS CORE
representation which everybody can handle from a somehow more
experimental (admitted) extension which goes beyound established
skos(xl) patterns. But for UMTHES need it now (!) as an exchange format
in a real production scenario, so we cannot wait.

Thanks Johan for your comments, really helpful to think this over more

Thomas Bandholtz, thomas.bandholtz@innoq.com, http://www.innoq.com 
innoQ Deutschland GmbH, Halskestr. 17, D-40880 Ratingen, Germany
Phone: +49 228 9288490 Mobile: +49 178 4049387 Fax: +49 228 9288491
Received on Thursday, 22 October 2009 20:40:41 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 2 March 2016 13:32:12 UTC