Re: multilingual lexical entries? from Francis Bond on 2022-01-06 (public-ontolex@w3.org from January 2022)

From: Francis Bond <bond@ieee.org>
Date: Thu, 6 Jan 2022 09:38:39 +0800
To: Christian Chiarcos <christian.chiarcos@gmail.com>
Cc: Jorge Gracia del Río <jogracia@unizar.es>, public-ontolex <public-ontolex@w3.org>
Message-ID: <CA+arSXiHM4wZbux0K6hLhGJg1QATJNS=2OKsjCzzLU60+tKofg@mail.gmail.com>
Can you instead model languages hierarchically?

So say for a word we don't know if it is Malay or Indonesian, we mark it as
Malay_family or whatever name we chose.  Then these entries can have a
language, just that it is underspecified, ...

Note that the supertype does not have to be related genealogically,
especially in the cases where the same script may be used for multiple
language families.


On Wed, Jan 5, 2022 at 11:35 PM Christian Chiarcos <
christian.chiarcos@gmail.com> wrote:

> Dear Jorge,
>
> thanks for the suggestion. Of course that would work from a modelling
> perspective, but the problem is that in many cases we just don't know what
> the language is, and it could be either Sumerian or Akkadian and even have
> different readings (i.e., Latin renderings) for the same signs. For a
> frequent word like a unit of weight (as in the example), this clearly
> applies to both languages, but in other cases we risk creating ghost
> entries instead of providing the language only in cases where we are
> certain about the language.
>
> For this particular case (etymological dictionaries are different), the
> problem is not so much that the entries are multilingual, but that the
> defining criterion what enters our dictionary is not the language, but the
> writing system, time and provenance of the writing. At times, we don't know
> the language, and for languages with ideographic writing systems, this can
> occur regularly. There are, indeed, entire writing systems that are not
> language-specific and for whose texts we cannot really tell what the
> language was (e.g., https://en.wikipedia.org/wiki/Zapotec_script, whose
> tendency towards abandoning syllabic characters seems to be motivated by
> its spread to foreign speaker communities; the linguistic identification of
> the entire Teotihuacano writing is very uncertain, cf.
> https://www.mesoweb.com/bearc/caa/AA01.pdf, and also early Sumerian
> writing is fully pictorial, so we cannot ascertain its actual language and
> only speculate that it was Sumerian, e.g., for the
> https://en.wikipedia.org/wiki/Kish_tablet, -- and this has been debated).
>
> The practical problem is that we need to duplicate large parts of our
> dictionary, and in particular, this pertains to the attestations (all
> occurrences in the corpus should be linked). For a sample window of 100
> years (2100-2000 BCE), we are talking about a corpus of about 3 million
> tokens where the problem of multilinguality is particularly prevalent, and
> if no automated disambiguation can be performed, we might end up linking
> each token twice. With the current FrAC vocabulary, that would mean to
> create some 7.5 million additional triples (5 triples per attestation, for
> 3 million tokens) simply for the luxury of having two lexical entries. We
> could link the attestations to the lexical concept, but in fact, we need to
> link them with a particular form, not with a particular meaning. (So we
> need resolvable ontolex:Forms.) I am not sure whether the same form should
> occur with different lexical entries (this seems counter-intuitive, but is
> not formally required, depending on the generic or specific reading of the
> determiner in "one grammatical realization of *a* lexical entry."), but
> these need to be duplicated then, too. In fact, using one ontolex:Form with
> multiple lexical entries (i.e., the same entry for different languages)
> could be another solution to this problem.
>
> We will have the same problem for pictograms at some point. We certainly
> do for things like road signs and emoticons, which differ in form and
> function over certain areas (think of the use of stop signs in EU vs. US
> all-way stops), but these areas do not overlap with particular languages --
> and it is still possible (and there seems to be a need) to create
> machine-readable dictionaries for them:
> https://github.com/nikukyugamer/kaomojitoka-to-google-ime-dictionary.
>
> For this reason, because of its obscure way of introduction (i.e., not at
> LexicalEntry but in Lime), and because it is actually not part of any
> definition, but just mentioned in accompanying text, I am wondering whether
> OntoLex is actually supposed to have a single language constraint. I think
> it is clear that there must be a preference to have that (which is why lime
> says "should", not "must"), and that that should be formulated more
> explicitly in the core module. But also, I have a feeling that in the
> context of diachrony, multilingual terminology and multimedia the existence
> of cross-linguistic lexical entries will be a recurring question, so if any
> deviations or refinements of OntoLex core properties, e.g., in designated
> subclasses would be neccessary, it would be good to refer to that line in
> the documentation.
>
> I suggest to *decide* for one of the following additions to OntoLex core:
> (a) stricter definition: "A lexical entry can define its language using
> the properties lime:language or dct:language (see Metadata module). It is
> recommended to create different lexical entries for different languages."
> (b) broader definition: "A lexical entry can define one or multiple
> languages using the properties lime:language or dct:language (see Metadata
> module)."
> [insert right after "A Lexical Entry thus needs to be associated with at
> least one form, and has at most one canonical form (see below)."]
>
> This is a clarification for the following passages from Lime:
> "note that all entries in the same lexicon should be in the same language"
> (which does not say what happens if the same lexical entry occurs in
> multiple lexicons -- actually, this doesn't seem to be ruled out by Lime).
> "The language property indicates the language of a lexicon, a lexical
> entry, a concept set or a lexicalization set." (whether this says anything
> about cardinality constraints depends on the generic or exhaustive
> interpretation of the determiner, so this is ambiguous)
> "Beyond using the lime:language property, which has a Literal as a range,
> it is recommended to use the Dublin Core language property"
>
> Independently from what we will eventually decide, it makes sense to put a
> note on the language property into OntoLex core because the property occurs
> in diagram and examples, but not in the text.
>
> From the feedback I got so far I expect a general preference for (a), so
> this seems to be the default assumption. Personally, I am more in favor of
> the broader definition (b) because it does not invalidate any resources
> created in accordance with (a), it consistent with our earlier use for
> multilingual and etymological databases (which (a) is not), because we
> arrive at a more compact modelling and because it minimizes the dependency
> from non-core modules (which will make data less comprehensible for future
> users). Maybe others can give some feedback here.
>
> Best,
> Christian
>
> Am Mi., 5. Jan. 2022 um 10:38 Uhr schrieb Jorge Gracia del Río <
> jogracia@unizar.es>:
>
>> Dear Christian,
>>
>> What about this other approximation? That is, creating a
>> "language-agnostic" lexicog:entry per known record in the dictionary, and
>> then instantiate lexical entries to account for the language specific
>> information:
>>
>> :sze_concept a ontolex:LexicalConcept;
>>      skos:definition "unit of weight, approx 0.04 g" .
>>
>> :sze_sux a ontolex:LexicalEntry;
>>     ontolex:canonicalForm [
>>         ontolex:writtenRep "𒊺"@sux-Xsux;
>>         ontolex:writtenRep "sze"@sux-Latn
>>     ] .
>>
>> :sze_akk a ontolex:LexicalEntry;
>>     ontolex:canonicalForm [
>>        ontolex:writtenRep "𒊺"@akk-Xsux;
>>        ontolex:writtenRep "uţţatu"@akk-Latn
>>     ] .
>>
>> : sze_concept  ontolex:isEvokedBy :sze_sux:,  sze_akk  .
>>
>> :sze_entry a lexicog:Entry ;
>>      lexicog:describes sze_sux, :sze_akk .
>>
>>
>> Best regards,
>>
>> Jorge
>>
>> El mié, 8 dic 2021 a las 14:54, Christian Chiarcos (<
>> christian.chiarcos@gmail.com>) escribió:
>>
>>> Dear all,
>>>
>>> just for clarification, the following is what I would like to do:
>>>
>>> :sze_le a ontolex:LexicalEntry;
>>> ontolex:canonicalForm [
>>> ontolex:writtenRep "𒊺"; # or: ontolex:writtenRep "𒊺"@sux-Xsux, ontolex:writtenRep
>>> "𒊺"@akk-Xsux
>>> ontolex:writtenRep "sze"; # transliteration
>>> ontolex:writtenRep "sze"@sux-Latn; # transcription
>>> ontolex:writtenRep "uţţatu"@akk-Latn # transcription
>>> ]; ontolex:sense [ rdfs:comment "unit of weight, approx 0.04 g" ].
>>>
>>> The alternative with lexicog:Entry (and without duplicating
>>> LexicalEntries) would be
>>>
>>> :sze_le a lexicog:Entry;
>>> lexicog:describes [ a ontolex:Form;
>>> ontolex:writtenRep "𒊺";
>>> ontolex:writtenRep "sze"; # transliteration
>>> ontolex:writtenRep "sze"@sux; # transcription
>>> ontolex:writtenRep "uţţatu"@akk # transcription ... IMHO different
>>> language tags should be unproblematic for forms
>>> ]; lexicog:describes [ a ontolex:LexicalSense; rdfs:comment "unit of
>>> weight, approx 0.04 g"].
>>>
>>> The latter way of modelling should be in line with the documentation,
>>> but it makes large parts of OntoLex-Lemon redundant and others (e.g.,
>>> canonicalForm) inapplicable, I would prefer to avoid that.
>>>
>>> Best,
>>> Christian
>>>
>>> Am Di., 7. Dez. 2021 um 16:32 Uhr schrieb Christian Chiarcos <
>>> christian.chiarcos@gmail.com>:
>>>
>>>> Dear all,
>>>>
>>>> for different use cases, I came across the need to provide one lexical
>>>> entry for multiple languages.
>>>>
>>>> In one group of cases (esp., etymological dictionaries), this can be
>>>> circumvented by using lexicog:Entry, instead, and then point to
>>>> language-specific lexical entries. (Though this is very inelegant,
>>>> unnecessarily verbose and clearly a departure from/obfuscation of the
>>>> original structure of the lexical resource, but technically, it is a
>>>> possibility.)
>>>>
>>>> However, in another case (dictionaries/glossaries for cuneiform
>>>> languages), we have the problem that we cannot always tell what language a
>>>> text (and thus, a word) is in. This is because of the multilingual
>>>> situation of Sumerian and Akkadian during the 3rd m. BC, because of the use
>>>> of ideographic signs, because of the laziness of scribes to often not write
>>>> morphemes, but just the stem of a word, and because of the habit of
>>>> Akkadian and Hittite scibes to just write Sumerian (or Akkadian) words
>>>> instead of their native tongue because these were more established in the
>>>> writing tradition. Although there are phonological or morphological
>>>> complements that can reveal the language, these are not systematically
>>>> used, so that we have uncertainties about the language of individual words
>>>> or even entire texts. However, if these texts form the basis for a glossary
>>>> or dictionary, these uncertainties percolate to the glossary, especially if
>>>> it is corpus-based. The Electronic Penn Sumerian dictionary thus does not
>>>> distinguish Sumerian and Akkadian forms and just groups everything under
>>>> the same head word and just provides Sumerian and Akkadian readings of the
>>>> same sign. (The selection of texts is such that a Sumerian reading is more
>>>> likely, but it is not always necessary.) In some cases in this dictionary,
>>>> it is even marked that there are doubts that a word is Sumerian in the
>>>> first place (http://oracc.museum.upenn.edu/epsd2/cbd/sux/o0023151.html
>>>> ).
>>>>
>>>> Such data does not allow to create distinct lexical entries for both
>>>> (or, in case of Hittite texts, three) languages that would just go under
>>>> the same lexicog:Entry, because we cannot decide which information (other
>>>> than the possible Sumerian and Akkadian interpretations of the same
>>>> Cuneiform writtenRep) belongs to which lexical entry.
>>>>
>>>> For this reason, we are currently considering to have language-agnostic
>>>> lexical entries for a future CDLI glossary (https://cdli.ucla.edu/),
>>>> where language information is provided only at the form (or even, within
>>>> the writtenRep), but not at the lexical entry. Note that there is no
>>>> constraint in the OntoLex core model that requires a single language per
>>>> lexical entry.
>>>>
>>>> What OntoLex says about language is not in the core model, but in Lime:
>>>> "note that all entries in the same lexicon should be in the same language
>>>> and that the language of the lexicon and entry should be consistent with
>>>> the language tags used on all forms". This a comment (in parenthesis, in
>>>> accompanying text, and if assumed to be relevant for the definition of
>>>> ontolex:LexicalEntry, in the wrong place), formulated as a recommendation
>>>> and not part of any definition.
>>>>
>>>> If we consider this statement to be nevertheless binding, the CDLI
>>>> solution would be to create a dictionary with senses and lexicog:Entrys,
>>>> but without ontolex:Entrys. I would prefer not to. (I would still prefer to
>>>> avoid multilingual lexical entries in cases in which language-specific
>>>> information is provided, and thus to keep the recommendation in place, as
>>>> is, but this is not the case here.)
>>>>
>>>> Best,
>>>> Christian
>>>>
>>>

-- 
Francis Bond <https://fcbond.github.io/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
Received on Thursday, 6 January 2022 01:39:12 UTC