multilingual lexical entries? from Christian Chiarcos on 2021-12-07 (public-ontolex@w3.org from December 2021)

From: Christian Chiarcos <christian.chiarcos@gmail.com>
Date: Tue, 7 Dec 2021 16:32:48 +0100
To: public-ontolex <public-ontolex@w3.org>
Message-ID: <CAC1YGdibR-ufs=o3sUsQvMLFx+JE7W0YKvCxrPdRMga5-Hf+tg@mail.gmail.com>
Dear all,

for different use cases, I came across the need to provide one lexical
entry for multiple languages.

In one group of cases (esp., etymological dictionaries), this can be
circumvented by using lexicog:Entry, instead, and then point to
language-specific lexical entries. (Though this is very inelegant,
unnecessarily verbose and clearly a departure from/obfuscation of the
original structure of the lexical resource, but technically, it is a
possibility.)

However, in another case (dictionaries/glossaries for cuneiform languages),
we have the problem that we cannot always tell what language a text (and
thus, a word) is in. This is because of the multilingual situation of
Sumerian and Akkadian during the 3rd m. BC, because of the use of
ideographic signs, because of the laziness of scribes to often not write
morphemes, but just the stem of a word, and because of the habit of
Akkadian and Hittite scibes to just write Sumerian (or Akkadian) words
instead of their native tongue because these were more established in the
writing tradition. Although there are phonological or morphological
complements that can reveal the language, these are not systematically
used, so that we have uncertainties about the language of individual words
or even entire texts. However, if these texts form the basis for a glossary
or dictionary, these uncertainties percolate to the glossary, especially if
it is corpus-based. The Electronic Penn Sumerian dictionary thus does not
distinguish Sumerian and Akkadian forms and just groups everything under
the same head word and just provides Sumerian and Akkadian readings of the
same sign. (The selection of texts is such that a Sumerian reading is more
likely, but it is not always necessary.) In some cases in this dictionary,
it is even marked that there are doubts that a word is Sumerian in the
first place (http://oracc.museum.upenn.edu/epsd2/cbd/sux/o0023151.html).

Such data does not allow to create distinct lexical entries for both (or,
in case of Hittite texts, three) languages that would just go under the
same lexicog:Entry, because we cannot decide which information (other than
the possible Sumerian and Akkadian interpretations of the same Cuneiform
writtenRep) belongs to which lexical entry.

For this reason, we are currently considering to have language-agnostic
lexical entries for a future CDLI glossary (https://cdli.ucla.edu/), where
language information is provided only at the form (or even, within the
writtenRep), but not at the lexical entry. Note that there is no constraint
in the OntoLex core model that requires a single language per lexical entry.

What OntoLex says about language is not in the core model, but in Lime:
"note that all entries in the same lexicon should be in the same language
and that the language of the lexicon and entry should be consistent with
the language tags used on all forms". This a comment (in parenthesis, in
accompanying text, and if assumed to be relevant for the definition of
ontolex:LexicalEntry, in the wrong place), formulated as a recommendation
and not part of any definition.

If we consider this statement to be nevertheless binding, the CDLI solution
would be to create a dictionary with senses and lexicog:Entrys, but without
ontolex:Entrys. I would prefer not to. (I would still prefer to avoid
multilingual lexical entries in cases in which language-specific
information is provided, and thus to keep the recommendation in place, as
is, but this is not the case here.)

Best,
Christian
Received on Wednesday, 8 December 2021 08:00:38 UTC