Re: multilingual lexical entries? from Christian Chiarcos on 2022-01-06 (public-ontolex@w3.org from January 2022)

From: Christian Chiarcos <christian.chiarcos@gmail.com>
Date: Thu, 6 Jan 2022 06:36:51 +0100
To: Francis Bond <bond@ieee.org>
Cc: Jorge Gracia del Río <jogracia@unizar.es>, public-ontolex <public-ontolex@w3.org>
Message-ID: <CAC1YGdgJ7RzdXiws5ZftEpdUt_4NmP5eDDPE0b+o9yMangi9ZQ@mail.gmail.com>
Hi Francis,

Am Do., 6. Jan. 2022 um 02:38 Uhr schrieb Francis Bond <bond@ieee.org>:

> Can you instead model languages hierarchically?
>

An interesting thought. The OntoLex spec says we should resort to BCP47 or
(ISO 639 URIs in) lexvo, so, technically, this is limited. However,
ISO639-3 has macro-languages, so for you case you could use msa [Malay
macro] along with mhp [Balinese Malay], etc. BCP47 has a more flexible
mechanism, with private tags after -x-, so you could use the language tag
mis-x-however-you-like-to-call-it for something not in ISO639. In both
scenarios you end up with a flat list of language tags, not with a
hierarchy, but of course, the private tag can just refer to another
ontology that may define a hierachy (I used GlottoLog IDs [not URIs] in
this way). Although this sounds like a solution, it is not a good one,
because RDF semantics ignore everything after the primary tag, so this
really means the same as using mis (unclassified language). Of course we
can produce a mis lexicon, or just do not identify the language in the
first place.

In my scenario, I could use the BCP47 language tag mis-Xsux (unclassifiable
language in Cuneiform), but this would not be quite correct, as individual
forms may indeed be identifiable as either Sumerian or Akkadian (from their
inflection -- the base form wouldn't have that), and if it features forms
with both Sumerian and Akkadian language, the entry is not either
("unclassified"), but *both* Sumerian and Akkadian. Also, we would
certainly want to give a Latin transcription, so the forms could have not
only mis-Xsux, but also sux-Latn and akk-Latn tags, and if the entry itself
is defined as being ...-Xsux, something ...-Latn would be, well, unexpected.

So say for a word we don't know if it is Malay or Indonesian, we mark it as
> Malay_family or whatever name we chose.  Then these entries can have a
> language, just that it is underspecified, ...
>

Yes, that situation is similar to ours.

Note that the supertype does not have to be related genealogically,
> especially in the cases where the same script may be used for multiple
> language families.
>

This is exactly our situation.  Another case could be that of a dictionary
of Kanbun literature. This is effectively Chinese (BCP47 zh?), but written
by Japanese and to some extent encoding features of Japanese (BCP47
ja-Han?). I would assume that certain forms or expressions are more in line
with Chinese on the one hand or more with Japanese-on the other and that it
would be desirable to make the difference explicit, but also that the same
expression can occur in a more clearly Japanese or a more clearly Chinese
context (e.g., indicated by word order). I would also assume that short
Kanbun texts can be hard to classify for whether they represent Chinese or
Japanese at all. Nevertheless, a Kanbun glossary would deal with a
well-defined domain, and artificially splitting that into a Japanese and a
Chinese subset seems unnatural, to say the least.

Best,
Christian


>
> On Wed, Jan 5, 2022 at 11:35 PM Christian Chiarcos <
> christian.chiarcos@gmail.com> wrote:
>
>> Dear Jorge,
>>
>> thanks for the suggestion. Of course that would work from a modelling
>> perspective, but the problem is that in many cases we just don't know what
>> the language is, and it could be either Sumerian or Akkadian and even have
>> different readings (i.e., Latin renderings) for the same signs. For a
>> frequent word like a unit of weight (as in the example), this clearly
>> applies to both languages, but in other cases we risk creating ghost
>> entries instead of providing the language only in cases where we are
>> certain about the language.
>>
>> For this particular case (etymological dictionaries are different), the
>> problem is not so much that the entries are multilingual, but that the
>> defining criterion what enters our dictionary is not the language, but the
>> writing system, time and provenance of the writing. At times, we don't know
>> the language, and for languages with ideographic writing systems, this can
>> occur regularly. There are, indeed, entire writing systems that are not
>> language-specific and for whose texts we cannot really tell what the
>> language was (e.g., https://en.wikipedia.org/wiki/Zapotec_script, whose
>> tendency towards abandoning syllabic characters seems to be motivated by
>> its spread to foreign speaker communities; the linguistic identification of
>> the entire Teotihuacano writing is very uncertain, cf.
>> https://www.mesoweb.com/bearc/caa/AA01.pdf, and also early Sumerian
>> writing is fully pictorial, so we cannot ascertain its actual language and
>> only speculate that it was Sumerian, e.g., for the
>> https://en.wikipedia.org/wiki/Kish_tablet, -- and this has been
>> debated).
>>
>> The practical problem is that we need to duplicate large parts of our
>> dictionary, and in particular, this pertains to the attestations (all
>> occurrences in the corpus should be linked). For a sample window of 100
>> years (2100-2000 BCE), we are talking about a corpus of about 3 million
>> tokens where the problem of multilinguality is particularly prevalent, and
>> if no automated disambiguation can be performed, we might end up linking
>> each token twice. With the current FrAC vocabulary, that would mean to
>> create some 7.5 million additional triples (5 triples per attestation, for
>> 3 million tokens) simply for the luxury of having two lexical entries. We
>> could link the attestations to the lexical concept, but in fact, we need to
>> link them with a particular form, not with a particular meaning. (So we
>> need resolvable ontolex:Forms.) I am not sure whether the same form should
>> occur with different lexical entries (this seems counter-intuitive, but is
>> not formally required, depending on the generic or specific reading of the
>> determiner in "one grammatical realization of *a* lexical entry."), but
>> these need to be duplicated then, too. In fact, using one ontolex:Form with
>> multiple lexical entries (i.e., the same entry for different languages)
>> could be another solution to this problem.
>>
>> We will have the same problem for pictograms at some point. We certainly
>> do for things like road signs and emoticons, which differ in form and
>> function over certain areas (think of the use of stop signs in EU vs. US
>> all-way stops), but these areas do not overlap with particular languages --
>> and it is still possible (and there seems to be a need) to create
>> machine-readable dictionaries for them:
>> https://github.com/nikukyugamer/kaomojitoka-to-google-ime-dictionary.
>>
>> For this reason, because of its obscure way of introduction (i.e., not at
>> LexicalEntry but in Lime), and because it is actually not part of any
>> definition, but just mentioned in accompanying text, I am wondering whether
>> OntoLex is actually supposed to have a single language constraint. I think
>> it is clear that there must be a preference to have that (which is why lime
>> says "should", not "must"), and that that should be formulated more
>> explicitly in the core module. But also, I have a feeling that in the
>> context of diachrony, multilingual terminology and multimedia the existence
>> of cross-linguistic lexical entries will be a recurring question, so if any
>> deviations or refinements of OntoLex core properties, e.g., in designated
>> subclasses would be neccessary, it would be good to refer to that line in
>> the documentation.
>>
>> I suggest to *decide* for one of the following additions to OntoLex core:
>> (a) stricter definition: "A lexical entry can define its language using
>> the properties lime:language or dct:language (see Metadata module). It is
>> recommended to create different lexical entries for different languages."
>> (b) broader definition: "A lexical entry can define one or multiple
>> languages using the properties lime:language or dct:language (see Metadata
>> module)."
>> [insert right after "A Lexical Entry thus needs to be associated with at
>> least one form, and has at most one canonical form (see below)."]
>>
>> This is a clarification for the following passages from Lime:
>> "note that all entries in the same lexicon should be in the same
>> language" (which does not say what happens if the same lexical entry occurs
>> in multiple lexicons -- actually, this doesn't seem to be ruled out by
>> Lime).
>> "The language property indicates the language of a lexicon, a lexical
>> entry, a concept set or a lexicalization set." (whether this says anything
>> about cardinality constraints depends on the generic or exhaustive
>> interpretation of the determiner, so this is ambiguous)
>> "Beyond using the lime:language property, which has a Literal as a range,
>> it is recommended to use the Dublin Core language property"
>>
>> Independently from what we will eventually decide, it makes sense to put
>> a note on the language property into OntoLex core because the property
>> occurs in diagram and examples, but not in the text.
>>
>> From the feedback I got so far I expect a general preference for (a), so
>> this seems to be the default assumption. Personally, I am more in favor of
>> the broader definition (b) because it does not invalidate any resources
>> created in accordance with (a), it consistent with our earlier use for
>> multilingual and etymological databases (which (a) is not), because we
>> arrive at a more compact modelling and because it minimizes the dependency
>> from non-core modules (which will make data less comprehensible for future
>> users). Maybe others can give some feedback here.
>>
>> Best,
>> Christian
>>
>> Am Mi., 5. Jan. 2022 um 10:38 Uhr schrieb Jorge Gracia del Río <
>> jogracia@unizar.es>:
>>
>>> Dear Christian,
>>>
>>> What about this other approximation? That is, creating a
>>> "language-agnostic" lexicog:entry per known record in the dictionary, and
>>> then instantiate lexical entries to account for the language specific
>>> information:
>>>
>>> :sze_concept a ontolex:LexicalConcept;
>>>      skos:definition "unit of weight, approx 0.04 g" .
>>>
>>> :sze_sux a ontolex:LexicalEntry;
>>>     ontolex:canonicalForm [
>>>         ontolex:writtenRep "𒊺"@sux-Xsux;
>>>         ontolex:writtenRep "sze"@sux-Latn
>>>     ] .
>>>
>>> :sze_akk a ontolex:LexicalEntry;
>>>     ontolex:canonicalForm [
>>>        ontolex:writtenRep "𒊺"@akk-Xsux;
>>>        ontolex:writtenRep "uţţatu"@akk-Latn
>>>     ] .
>>>
>>> : sze_concept  ontolex:isEvokedBy :sze_sux:,  sze_akk  .
>>>
>>> :sze_entry a lexicog:Entry ;
>>>      lexicog:describes sze_sux, :sze_akk .
>>>
>>>
>>> Best regards,
>>>
>>> Jorge
>>>
>>> El mié, 8 dic 2021 a las 14:54, Christian Chiarcos (<
>>> christian.chiarcos@gmail.com>) escribió:
>>>
>>>> Dear all,
>>>>
>>>> just for clarification, the following is what I would like to do:
>>>>
>>>> :sze_le a ontolex:LexicalEntry;
>>>> ontolex:canonicalForm [
>>>> ontolex:writtenRep "𒊺"; # or: ontolex:writtenRep "𒊺"@sux-Xsux, ontolex:writtenRep
>>>> "𒊺"@akk-Xsux
>>>> ontolex:writtenRep "sze"; # transliteration
>>>> ontolex:writtenRep "sze"@sux-Latn; # transcription
>>>> ontolex:writtenRep "uţţatu"@akk-Latn # transcription
>>>> ]; ontolex:sense [ rdfs:comment "unit of weight, approx 0.04 g" ].
>>>>
>>>> The alternative with lexicog:Entry (and without duplicating
>>>> LexicalEntries) would be
>>>>
>>>> :sze_le a lexicog:Entry;
>>>> lexicog:describes [ a ontolex:Form;
>>>> ontolex:writtenRep "𒊺";
>>>> ontolex:writtenRep "sze"; # transliteration
>>>> ontolex:writtenRep "sze"@sux; # transcription
>>>> ontolex:writtenRep "uţţatu"@akk # transcription ... IMHO different
>>>> language tags should be unproblematic for forms
>>>> ]; lexicog:describes [ a ontolex:LexicalSense; rdfs:comment "unit of
>>>> weight, approx 0.04 g"].
>>>>
>>>> The latter way of modelling should be in line with the documentation,
>>>> but it makes large parts of OntoLex-Lemon redundant and others (e.g.,
>>>> canonicalForm) inapplicable, I would prefer to avoid that.
>>>>
>>>> Best,
>>>> Christian
>>>>
>>>> Am Di., 7. Dez. 2021 um 16:32 Uhr schrieb Christian Chiarcos <
>>>> christian.chiarcos@gmail.com>:
>>>>
>>>>> Dear all,
>>>>>
>>>>> for different use cases, I came across the need to provide one lexical
>>>>> entry for multiple languages.
>>>>>
>>>>> In one group of cases (esp., etymological dictionaries), this can be
>>>>> circumvented by using lexicog:Entry, instead, and then point to
>>>>> language-specific lexical entries. (Though this is very inelegant,
>>>>> unnecessarily verbose and clearly a departure from/obfuscation of the
>>>>> original structure of the lexical resource, but technically, it is a
>>>>> possibility.)
>>>>>
>>>>> However, in another case (dictionaries/glossaries for cuneiform
>>>>> languages), we have the problem that we cannot always tell what language a
>>>>> text (and thus, a word) is in. This is because of the multilingual
>>>>> situation of Sumerian and Akkadian during the 3rd m. BC, because of the use
>>>>> of ideographic signs, because of the laziness of scribes to often not write
>>>>> morphemes, but just the stem of a word, and because of the habit of
>>>>> Akkadian and Hittite scibes to just write Sumerian (or Akkadian) words
>>>>> instead of their native tongue because these were more established in the
>>>>> writing tradition. Although there are phonological or morphological
>>>>> complements that can reveal the language, these are not systematically
>>>>> used, so that we have uncertainties about the language of individual words
>>>>> or even entire texts. However, if these texts form the basis for a glossary
>>>>> or dictionary, these uncertainties percolate to the glossary, especially if
>>>>> it is corpus-based. The Electronic Penn Sumerian dictionary thus does not
>>>>> distinguish Sumerian and Akkadian forms and just groups everything under
>>>>> the same head word and just provides Sumerian and Akkadian readings of the
>>>>> same sign. (The selection of texts is such that a Sumerian reading is more
>>>>> likely, but it is not always necessary.) In some cases in this dictionary,
>>>>> it is even marked that there are doubts that a word is Sumerian in the
>>>>> first place (http://oracc.museum.upenn.edu/epsd2/cbd/sux/o0023151.html
>>>>> ).
>>>>>
>>>>> Such data does not allow to create distinct lexical entries for both
>>>>> (or, in case of Hittite texts, three) languages that would just go under
>>>>> the same lexicog:Entry, because we cannot decide which information (other
>>>>> than the possible Sumerian and Akkadian interpretations of the same
>>>>> Cuneiform writtenRep) belongs to which lexical entry.
>>>>>
>>>>> For this reason, we are currently considering to have
>>>>> language-agnostic lexical entries for a future CDLI glossary (
>>>>> https://cdli.ucla.edu/), where language information is provided only
>>>>> at the form (or even, within the writtenRep), but not at the lexical entry.
>>>>> Note that there is no constraint in the OntoLex core model that requires a
>>>>> single language per lexical entry.
>>>>>
>>>>> What OntoLex says about language is not in the core model, but in
>>>>> Lime: "note that all entries in the same lexicon should be in the same
>>>>> language and that the language of the lexicon and entry should be
>>>>> consistent with the language tags used on all forms". This a comment (in
>>>>> parenthesis, in accompanying text, and if assumed to be relevant for the
>>>>> definition of ontolex:LexicalEntry, in the wrong place), formulated as a
>>>>> recommendation and not part of any definition.
>>>>>
>>>>> If we consider this statement to be nevertheless binding, the CDLI
>>>>> solution would be to create a dictionary with senses and lexicog:Entrys,
>>>>> but without ontolex:Entrys. I would prefer not to. (I would still prefer to
>>>>> avoid multilingual lexical entries in cases in which language-specific
>>>>> information is provided, and thus to keep the recommendation in place, as
>>>>> is, but this is not the case here.)
>>>>>
>>>>> Best,
>>>>> Christian
>>>>>
>>>>
>
> --
> Francis Bond <https://fcbond.github.io/>
> Division of Linguistics and Multilingual Studies
> Nanyang Technological University
>
Received on Thursday, 6 January 2022 05:37:18 UTC