Re: MHDBDB: Connecting Senses with multiple Concepts from Christian Chiarcos on 2018-11-06 (public-ontolex@w3.org from November 2018)

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Date: Tue, 06 Nov 2018 19:16:16 +0100
To: "John P. McCrae" <john.mccrae@insight-centre.org>
Cc: "public-ontolex@w3.org" <public-ontolex@w3.org>, "Hinkelmanns Peter" <peter.hinkelmanns@sbg.ac.at>
Message-ID: <op.zr2k1evv89jat0@kitaba.rz.uni-frankfurt.de>
Am .11.2018, 16:36 Uhr, schrieb John P. McCrae  
<john.mccrae@insight-centre.org>:

> Yes, to summarize other authors, it is not expected that a sense should  
> have multiple references, unless they are semantically equivalent (e.g.,  
> >skos:exactMatch). This seems relatively straightforward if you think  
> about... if you make a distinction in your SKOS thesaurus, why wouldn't  
> the same >distinction be necessary in the lexicon?

It is straight-forward, indeed, but only if either
(1) you model thesaurus and lexicon from scratch and as a single resource  
(which seems to be the case here),
(2) you start with an existing ontology and want to build a dictionary for  
it, or
(3) you start with a dictionary and want to build an ontology for it.

In one case, it is not:
(4) you want to combine an existing dictionary and an existing ontology  
with each other.

There is another reason: It is possible that a thesaurus provides  
high-level distinctions only. Think of Dorfseiff groups for German  
"Wasser" (water,  
http://corpora.uni-leipzig.de/de/res?corpusId=deu_newscrawl_2011&word=Wasser):

7.8 transparent
7.61 liquid
13.21 anorganic chemistry
16.8 drinks, non-alcoholic

And so they occur in the Wikipedia definition:
"Wasser (H2O) ist eine chemische Verbindung (=> 13.21) aus den Elementen  
Sauerstoff (O) und Wasserstoff (H). Wasser ist als Flüssigkeit (=> 7.61)  
durchsichtig (=> 7.61)".

Obviously, we can create a concept "transparent liquid; anorganic;  
suitable for drinking", but this is not provided by Dornseiff & Quasthoff  
(2004), and if we (being neither the creators of whatever dictionary we  
start with nor the thesaurus) create it, it needs to have a different  
ontological status that the rest of the thesaurus -- because it differs in  
provenance.

The underlying problem of this particular thesaurus is that it is focusing  
on feature decomposition rather than on providing a concept inventory.  
This is not untypical for thesauri. My feeling in the Dornseiff case is  
actually that we should not use ontolex:reference, but ontolex:denotes.  
This is somewhat vague in its definition, but it also corresponds to the  
rather abstract nature of the thesaurus concepts. If the MHDBDB categories  
are rather abstract (I remember some are), this would be an alternative,  
there, too. No cardinality restrictions apply to ontolex:denotes.

> Similarly, different parts-of-speech necessarily have different meanings,

We have counterexamples to this, often among function words: Many English  
prepositions are also complementizers (subordinating conjunctions), verbal  
particles, and sometimes adverbs -- these do not necessarily differ in  
meaning, but only in syntax (i.e., the element they modify). Pronouns and  
determiners are another typical case, hence some tagsets just lump them  
together -- lexinfo doesn't.

In open categories, such phenomena do occur, as well, but they are usually  
treated as "zero morphology". German adjectives can be systematically used  
as adverbs (again, these differ only by the element they modify).

A third case is in lexicography of historical language varieties, where  
parts of speech may have changed during the time the dictionary covers,  
and we might want to formulate different interpretations. Old High German  
prepositions could be used as adverbs, and many prepositional adverbs were  
grammaticalized into verbal particles over time, but there is a long  
transition period where a preposition can be equally regarded (by a modern  
lexicographer) as verbal particle or adverb -- in the same syntactic  
context, two parts of speech apply (particle or adverb), resp., they are  
indistinguishable, and they could be recorded as such in a historical  
dictionary of German.

Data sparsity is yet another source of multiple parts of speech: For  
example, a word may be attested, but not with an unambiguous context.  
Hypothetical parts of speech are typical for low-resource languages (see  
https://archive.org/details/lalanguegauloise00dottuoft/page/262, "ieuru  
... datif singulier ou forme verbale", and later "iorebe ... verbe ou  
datif pluriel"). We don't want to create one LexicalEntry per  
*hypothetical* part-of-speech, do we?

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931
Received on Tuesday, 6 November 2018 18:16:45 UTC