Re: multilingual lexical entries? from Christian Chiarcos on 2022-01-05 (public-ontolex@w3.org from January 2022)

From: Christian Chiarcos <christian.chiarcos@gmail.com>
Date: Wed, 5 Jan 2022 16:34:00 +0100
To: Jorge Gracia del Río <jogracia@unizar.es>
Cc: public-ontolex <public-ontolex@w3.org>
Message-ID: <CAC1YGdgTn5bafRLBNY-qb_TVZ0bUe692y3_fL8eiLtJz7phsEg@mail.gmail.com>
Dear Jorge,

thanks for the suggestion. Of course that would work from a modelling
perspective, but the problem is that in many cases we just don't know what
the language is, and it could be either Sumerian or Akkadian and even have
different readings (i.e., Latin renderings) for the same signs. For a
frequent word like a unit of weight (as in the example), this clearly
applies to both languages, but in other cases we risk creating ghost
entries instead of providing the language only in cases where we are
certain about the language.

For this particular case (etymological dictionaries are different), the
problem is not so much that the entries are multilingual, but that the
defining criterion what enters our dictionary is not the language, but the
writing system, time and provenance of the writing. At times, we don't know
the language, and for languages with ideographic writing systems, this can
occur regularly. There are, indeed, entire writing systems that are not
language-specific and for whose texts we cannot really tell what the
language was (e.g., https://en.wikipedia.org/wiki/Zapotec_script, whose
tendency towards abandoning syllabic characters seems to be motivated by
its spread to foreign speaker communities; the linguistic identification of
the entire Teotihuacano writing is very uncertain, cf.
https://www.mesoweb.com/bearc/caa/AA01.pdf, and also early Sumerian writing
is fully pictorial, so we cannot ascertain its actual language and only
speculate that it was Sumerian, e.g., for the
https://en.wikipedia.org/wiki/Kish_tablet, -- and this has been debated).

The practical problem is that we need to duplicate large parts of our
dictionary, and in particular, this pertains to the attestations (all
occurrences in the corpus should be linked). For a sample window of 100
years (2100-2000 BCE), we are talking about a corpus of about 3 million
tokens where the problem of multilinguality is particularly prevalent, and
if no automated disambiguation can be performed, we might end up linking
each token twice. With the current FrAC vocabulary, that would mean to
create some 7.5 million additional triples (5 triples per attestation, for
3 million tokens) simply for the luxury of having two lexical entries. We
could link the attestations to the lexical concept, but in fact, we need to
link them with a particular form, not with a particular meaning. (So we
need resolvable ontolex:Forms.) I am not sure whether the same form should
occur with different lexical entries (this seems counter-intuitive, but is
not formally required, depending on the generic or specific reading of the
determiner in "one grammatical realization of *a* lexical entry."), but
these need to be duplicated then, too. In fact, using one ontolex:Form with
multiple lexical entries (i.e., the same entry for different languages)
could be another solution to this problem.

We will have the same problem for pictograms at some point. We certainly do
for things like road signs and emoticons, which differ in form and function
over certain areas (think of the use of stop signs in EU vs. US all-way
stops), but these areas do not overlap with particular languages -- and it
is still possible (and there seems to be a need) to create machine-readable
dictionaries for them:
https://github.com/nikukyugamer/kaomojitoka-to-google-ime-dictionary.

For this reason, because of its obscure way of introduction (i.e., not at
LexicalEntry but in Lime), and because it is actually not part of any
definition, but just mentioned in accompanying text, I am wondering whether
OntoLex is actually supposed to have a single language constraint. I think
it is clear that there must be a preference to have that (which is why lime
says "should", not "must"), and that that should be formulated more
explicitly in the core module. But also, I have a feeling that in the
context of diachrony, multilingual terminology and multimedia the existence
of cross-linguistic lexical entries will be a recurring question, so if any
deviations or refinements of OntoLex core properties, e.g., in designated
subclasses would be neccessary, it would be good to refer to that line in
the documentation.

I suggest to *decide* for one of the following additions to OntoLex core:
(a) stricter definition: "A lexical entry can define its language using the
properties lime:language or dct:language (see Metadata module). It is
recommended to create different lexical entries for different languages."
(b) broader definition: "A lexical entry can define one or multiple
languages using the properties lime:language or dct:language (see Metadata
module)."
[insert right after "A Lexical Entry thus needs to be associated with at
least one form, and has at most one canonical form (see below)."]

This is a clarification for the following passages from Lime:
"note that all entries in the same lexicon should be in the same language"
(which does not say what happens if the same lexical entry occurs in
multiple lexicons -- actually, this doesn't seem to be ruled out by Lime).
"The language property indicates the language of a lexicon, a lexical
entry, a concept set or a lexicalization set." (whether this says anything
about cardinality constraints depends on the generic or exhaustive
interpretation of the determiner, so this is ambiguous)
"Beyond using the lime:language property, which has a Literal as a range,
it is recommended to use the Dublin Core language property"

Independently from what we will eventually decide, it makes sense to put a
note on the language property into OntoLex core because the property occurs
in diagram and examples, but not in the text.

From the feedback I got so far I expect a general preference for (a), so
this seems to be the default assumption. Personally, I am more in favor of
the broader definition (b) because it does not invalidate any resources
created in accordance with (a), it consistent with our earlier use for
multilingual and etymological databases (which (a) is not), because we
arrive at a more compact modelling and because it minimizes the dependency
from non-core modules (which will make data less comprehensible for future
users). Maybe others can give some feedback here.

Best,
Christian

Am Mi., 5. Jan. 2022 um 10:38 Uhr schrieb Jorge Gracia del Río <
jogracia@unizar.es>:

> Dear Christian,
>
> What about this other approximation? That is, creating a
> "language-agnostic" lexicog:entry per known record in the dictionary, and
> then instantiate lexical entries to account for the language specific
> information:
>
> :sze_concept a ontolex:LexicalConcept;
>      skos:definition "unit of weight, approx 0.04 g" .
>
> :sze_sux a ontolex:LexicalEntry;
>     ontolex:canonicalForm [
>         ontolex:writtenRep "𒊺"@sux-Xsux;
>         ontolex:writtenRep "sze"@sux-Latn
>     ] .
>
> :sze_akk a ontolex:LexicalEntry;
>     ontolex:canonicalForm [
>        ontolex:writtenRep "𒊺"@akk-Xsux;
>        ontolex:writtenRep "uţţatu"@akk-Latn
>     ] .
>
> : sze_concept  ontolex:isEvokedBy :sze_sux:,  sze_akk  .
>
> :sze_entry a lexicog:Entry ;
>      lexicog:describes sze_sux, :sze_akk .
>
>
> Best regards,
>
> Jorge
>
> El mié, 8 dic 2021 a las 14:54, Christian Chiarcos (<
> christian.chiarcos@gmail.com>) escribió:
>
>> Dear all,
>>
>> just for clarification, the following is what I would like to do:
>>
>> :sze_le a ontolex:LexicalEntry;
>> ontolex:canonicalForm [
>> ontolex:writtenRep "𒊺"; # or: ontolex:writtenRep "𒊺"@sux-Xsux, ontolex:writtenRep
>> "𒊺"@akk-Xsux
>> ontolex:writtenRep "sze"; # transliteration
>> ontolex:writtenRep "sze"@sux-Latn; # transcription
>> ontolex:writtenRep "uţţatu"@akk-Latn # transcription
>> ]; ontolex:sense [ rdfs:comment "unit of weight, approx 0.04 g" ].
>>
>> The alternative with lexicog:Entry (and without duplicating
>> LexicalEntries) would be
>>
>> :sze_le a lexicog:Entry;
>> lexicog:describes [ a ontolex:Form;
>> ontolex:writtenRep "𒊺";
>> ontolex:writtenRep "sze"; # transliteration
>> ontolex:writtenRep "sze"@sux; # transcription
>> ontolex:writtenRep "uţţatu"@akk # transcription ... IMHO different
>> language tags should be unproblematic for forms
>> ]; lexicog:describes [ a ontolex:LexicalSense; rdfs:comment "unit of
>> weight, approx 0.04 g"].
>>
>> The latter way of modelling should be in line with the documentation, but
>> it makes large parts of OntoLex-Lemon redundant and others (e.g.,
>> canonicalForm) inapplicable, I would prefer to avoid that.
>>
>> Best,
>> Christian
>>
>> Am Di., 7. Dez. 2021 um 16:32 Uhr schrieb Christian Chiarcos <
>> christian.chiarcos@gmail.com>:
>>
>>> Dear all,
>>>
>>> for different use cases, I came across the need to provide one lexical
>>> entry for multiple languages.
>>>
>>> In one group of cases (esp., etymological dictionaries), this can be
>>> circumvented by using lexicog:Entry, instead, and then point to
>>> language-specific lexical entries. (Though this is very inelegant,
>>> unnecessarily verbose and clearly a departure from/obfuscation of the
>>> original structure of the lexical resource, but technically, it is a
>>> possibility.)
>>>
>>> However, in another case (dictionaries/glossaries for cuneiform
>>> languages), we have the problem that we cannot always tell what language a
>>> text (and thus, a word) is in. This is because of the multilingual
>>> situation of Sumerian and Akkadian during the 3rd m. BC, because of the use
>>> of ideographic signs, because of the laziness of scribes to often not write
>>> morphemes, but just the stem of a word, and because of the habit of
>>> Akkadian and Hittite scibes to just write Sumerian (or Akkadian) words
>>> instead of their native tongue because these were more established in the
>>> writing tradition. Although there are phonological or morphological
>>> complements that can reveal the language, these are not systematically
>>> used, so that we have uncertainties about the language of individual words
>>> or even entire texts. However, if these texts form the basis for a glossary
>>> or dictionary, these uncertainties percolate to the glossary, especially if
>>> it is corpus-based. The Electronic Penn Sumerian dictionary thus does not
>>> distinguish Sumerian and Akkadian forms and just groups everything under
>>> the same head word and just provides Sumerian and Akkadian readings of the
>>> same sign. (The selection of texts is such that a Sumerian reading is more
>>> likely, but it is not always necessary.) In some cases in this dictionary,
>>> it is even marked that there are doubts that a word is Sumerian in the
>>> first place (http://oracc.museum.upenn.edu/epsd2/cbd/sux/o0023151.html).
>>>
>>> Such data does not allow to create distinct lexical entries for both
>>> (or, in case of Hittite texts, three) languages that would just go under
>>> the same lexicog:Entry, because we cannot decide which information (other
>>> than the possible Sumerian and Akkadian interpretations of the same
>>> Cuneiform writtenRep) belongs to which lexical entry.
>>>
>>> For this reason, we are currently considering to have language-agnostic
>>> lexical entries for a future CDLI glossary (https://cdli.ucla.edu/),
>>> where language information is provided only at the form (or even, within
>>> the writtenRep), but not at the lexical entry. Note that there is no
>>> constraint in the OntoLex core model that requires a single language per
>>> lexical entry.
>>>
>>> What OntoLex says about language is not in the core model, but in Lime:
>>> "note that all entries in the same lexicon should be in the same language
>>> and that the language of the lexicon and entry should be consistent with
>>> the language tags used on all forms". This a comment (in parenthesis, in
>>> accompanying text, and if assumed to be relevant for the definition of
>>> ontolex:LexicalEntry, in the wrong place), formulated as a recommendation
>>> and not part of any definition.
>>>
>>> If we consider this statement to be nevertheless binding, the CDLI
>>> solution would be to create a dictionary with senses and lexicog:Entrys,
>>> but without ontolex:Entrys. I would prefer not to. (I would still prefer to
>>> avoid multilingual lexical entries in cases in which language-specific
>>> information is provided, and thus to keep the recommendation in place, as
>>> is, but this is not the case here.)
>>>
>>> Best,
>>> Christian
>>>
>>
Received on Wednesday, 5 January 2022 15:34:26 UTC