- From: Christian Chiarcos <christian.chiarcos@gmail.com>
- Date: Thu, 6 Jan 2022 06:36:51 +0100
- To: Francis Bond <bond@ieee.org>
- Cc: Jorge Gracia del Río <jogracia@unizar.es>, public-ontolex <public-ontolex@w3.org>
- Message-ID: <CAC1YGdgJ7RzdXiws5ZftEpdUt_4NmP5eDDPE0b+o9yMangi9ZQ@mail.gmail.com>
Hi Francis, Am Do., 6. Jan. 2022 um 02:38 Uhr schrieb Francis Bond <bond@ieee.org>: > Can you instead model languages hierarchically? > An interesting thought. The OntoLex spec says we should resort to BCP47 or (ISO 639 URIs in) lexvo, so, technically, this is limited. However, ISO639-3 has macro-languages, so for you case you could use msa [Malay macro] along with mhp [Balinese Malay], etc. BCP47 has a more flexible mechanism, with private tags after -x-, so you could use the language tag mis-x-however-you-like-to-call-it for something not in ISO639. In both scenarios you end up with a flat list of language tags, not with a hierarchy, but of course, the private tag can just refer to another ontology that may define a hierachy (I used GlottoLog IDs [not URIs] in this way). Although this sounds like a solution, it is not a good one, because RDF semantics ignore everything after the primary tag, so this really means the same as using mis (unclassified language). Of course we can produce a mis lexicon, or just do not identify the language in the first place. In my scenario, I could use the BCP47 language tag mis-Xsux (unclassifiable language in Cuneiform), but this would not be quite correct, as individual forms may indeed be identifiable as either Sumerian or Akkadian (from their inflection -- the base form wouldn't have that), and if it features forms with both Sumerian and Akkadian language, the entry is not either ("unclassified"), but *both* Sumerian and Akkadian. Also, we would certainly want to give a Latin transcription, so the forms could have not only mis-Xsux, but also sux-Latn and akk-Latn tags, and if the entry itself is defined as being ...-Xsux, something ...-Latn would be, well, unexpected. So say for a word we don't know if it is Malay or Indonesian, we mark it as > Malay_family or whatever name we chose. Then these entries can have a > language, just that it is underspecified, ... > Yes, that situation is similar to ours. Note that the supertype does not have to be related genealogically, > especially in the cases where the same script may be used for multiple > language families. > This is exactly our situation. Another case could be that of a dictionary of Kanbun literature. This is effectively Chinese (BCP47 zh?), but written by Japanese and to some extent encoding features of Japanese (BCP47 ja-Han?). I would assume that certain forms or expressions are more in line with Chinese on the one hand or more with Japanese-on the other and that it would be desirable to make the difference explicit, but also that the same expression can occur in a more clearly Japanese or a more clearly Chinese context (e.g., indicated by word order). I would also assume that short Kanbun texts can be hard to classify for whether they represent Chinese or Japanese at all. Nevertheless, a Kanbun glossary would deal with a well-defined domain, and artificially splitting that into a Japanese and a Chinese subset seems unnatural, to say the least. Best, Christian > > On Wed, Jan 5, 2022 at 11:35 PM Christian Chiarcos < > christian.chiarcos@gmail.com> wrote: > >> Dear Jorge, >> >> thanks for the suggestion. Of course that would work from a modelling >> perspective, but the problem is that in many cases we just don't know what >> the language is, and it could be either Sumerian or Akkadian and even have >> different readings (i.e., Latin renderings) for the same signs. For a >> frequent word like a unit of weight (as in the example), this clearly >> applies to both languages, but in other cases we risk creating ghost >> entries instead of providing the language only in cases where we are >> certain about the language. >> >> For this particular case (etymological dictionaries are different), the >> problem is not so much that the entries are multilingual, but that the >> defining criterion what enters our dictionary is not the language, but the >> writing system, time and provenance of the writing. At times, we don't know >> the language, and for languages with ideographic writing systems, this can >> occur regularly. There are, indeed, entire writing systems that are not >> language-specific and for whose texts we cannot really tell what the >> language was (e.g., https://en.wikipedia.org/wiki/Zapotec_script, whose >> tendency towards abandoning syllabic characters seems to be motivated by >> its spread to foreign speaker communities; the linguistic identification of >> the entire Teotihuacano writing is very uncertain, cf. >> https://www.mesoweb.com/bearc/caa/AA01.pdf, and also early Sumerian >> writing is fully pictorial, so we cannot ascertain its actual language and >> only speculate that it was Sumerian, e.g., for the >> https://en.wikipedia.org/wiki/Kish_tablet, -- and this has been >> debated). >> >> The practical problem is that we need to duplicate large parts of our >> dictionary, and in particular, this pertains to the attestations (all >> occurrences in the corpus should be linked). For a sample window of 100 >> years (2100-2000 BCE), we are talking about a corpus of about 3 million >> tokens where the problem of multilinguality is particularly prevalent, and >> if no automated disambiguation can be performed, we might end up linking >> each token twice. With the current FrAC vocabulary, that would mean to >> create some 7.5 million additional triples (5 triples per attestation, for >> 3 million tokens) simply for the luxury of having two lexical entries. We >> could link the attestations to the lexical concept, but in fact, we need to >> link them with a particular form, not with a particular meaning. (So we >> need resolvable ontolex:Forms.) I am not sure whether the same form should >> occur with different lexical entries (this seems counter-intuitive, but is >> not formally required, depending on the generic or specific reading of the >> determiner in "one grammatical realization of *a* lexical entry."), but >> these need to be duplicated then, too. In fact, using one ontolex:Form with >> multiple lexical entries (i.e., the same entry for different languages) >> could be another solution to this problem. >> >> We will have the same problem for pictograms at some point. We certainly >> do for things like road signs and emoticons, which differ in form and >> function over certain areas (think of the use of stop signs in EU vs. US >> all-way stops), but these areas do not overlap with particular languages -- >> and it is still possible (and there seems to be a need) to create >> machine-readable dictionaries for them: >> https://github.com/nikukyugamer/kaomojitoka-to-google-ime-dictionary. >> >> For this reason, because of its obscure way of introduction (i.e., not at >> LexicalEntry but in Lime), and because it is actually not part of any >> definition, but just mentioned in accompanying text, I am wondering whether >> OntoLex is actually supposed to have a single language constraint. I think >> it is clear that there must be a preference to have that (which is why lime >> says "should", not "must"), and that that should be formulated more >> explicitly in the core module. But also, I have a feeling that in the >> context of diachrony, multilingual terminology and multimedia the existence >> of cross-linguistic lexical entries will be a recurring question, so if any >> deviations or refinements of OntoLex core properties, e.g., in designated >> subclasses would be neccessary, it would be good to refer to that line in >> the documentation. >> >> I suggest to *decide* for one of the following additions to OntoLex core: >> (a) stricter definition: "A lexical entry can define its language using >> the properties lime:language or dct:language (see Metadata module). It is >> recommended to create different lexical entries for different languages." >> (b) broader definition: "A lexical entry can define one or multiple >> languages using the properties lime:language or dct:language (see Metadata >> module)." >> [insert right after "A Lexical Entry thus needs to be associated with at >> least one form, and has at most one canonical form (see below)."] >> >> This is a clarification for the following passages from Lime: >> "note that all entries in the same lexicon should be in the same >> language" (which does not say what happens if the same lexical entry occurs >> in multiple lexicons -- actually, this doesn't seem to be ruled out by >> Lime). >> "The language property indicates the language of a lexicon, a lexical >> entry, a concept set or a lexicalization set." (whether this says anything >> about cardinality constraints depends on the generic or exhaustive >> interpretation of the determiner, so this is ambiguous) >> "Beyond using the lime:language property, which has a Literal as a range, >> it is recommended to use the Dublin Core language property" >> >> Independently from what we will eventually decide, it makes sense to put >> a note on the language property into OntoLex core because the property >> occurs in diagram and examples, but not in the text. >> >> From the feedback I got so far I expect a general preference for (a), so >> this seems to be the default assumption. Personally, I am more in favor of >> the broader definition (b) because it does not invalidate any resources >> created in accordance with (a), it consistent with our earlier use for >> multilingual and etymological databases (which (a) is not), because we >> arrive at a more compact modelling and because it minimizes the dependency >> from non-core modules (which will make data less comprehensible for future >> users). Maybe others can give some feedback here. >> >> Best, >> Christian >> >> Am Mi., 5. Jan. 2022 um 10:38 Uhr schrieb Jorge Gracia del Río < >> jogracia@unizar.es>: >> >>> Dear Christian, >>> >>> What about this other approximation? That is, creating a >>> "language-agnostic" lexicog:entry per known record in the dictionary, and >>> then instantiate lexical entries to account for the language specific >>> information: >>> >>> :sze_concept a ontolex:LexicalConcept; >>> skos:definition "unit of weight, approx 0.04 g" . >>> >>> :sze_sux a ontolex:LexicalEntry; >>> ontolex:canonicalForm [ >>> ontolex:writtenRep "𒊺"@sux-Xsux; >>> ontolex:writtenRep "sze"@sux-Latn >>> ] . >>> >>> :sze_akk a ontolex:LexicalEntry; >>> ontolex:canonicalForm [ >>> ontolex:writtenRep "𒊺"@akk-Xsux; >>> ontolex:writtenRep "uţţatu"@akk-Latn >>> ] . >>> >>> : sze_concept ontolex:isEvokedBy :sze_sux:, sze_akk . >>> >>> :sze_entry a lexicog:Entry ; >>> lexicog:describes sze_sux, :sze_akk . >>> >>> >>> Best regards, >>> >>> Jorge >>> >>> El mié, 8 dic 2021 a las 14:54, Christian Chiarcos (< >>> christian.chiarcos@gmail.com>) escribió: >>> >>>> Dear all, >>>> >>>> just for clarification, the following is what I would like to do: >>>> >>>> :sze_le a ontolex:LexicalEntry; >>>> ontolex:canonicalForm [ >>>> ontolex:writtenRep "𒊺"; # or: ontolex:writtenRep "𒊺"@sux-Xsux, ontolex:writtenRep >>>> "𒊺"@akk-Xsux >>>> ontolex:writtenRep "sze"; # transliteration >>>> ontolex:writtenRep "sze"@sux-Latn; # transcription >>>> ontolex:writtenRep "uţţatu"@akk-Latn # transcription >>>> ]; ontolex:sense [ rdfs:comment "unit of weight, approx 0.04 g" ]. >>>> >>>> The alternative with lexicog:Entry (and without duplicating >>>> LexicalEntries) would be >>>> >>>> :sze_le a lexicog:Entry; >>>> lexicog:describes [ a ontolex:Form; >>>> ontolex:writtenRep "𒊺"; >>>> ontolex:writtenRep "sze"; # transliteration >>>> ontolex:writtenRep "sze"@sux; # transcription >>>> ontolex:writtenRep "uţţatu"@akk # transcription ... IMHO different >>>> language tags should be unproblematic for forms >>>> ]; lexicog:describes [ a ontolex:LexicalSense; rdfs:comment "unit of >>>> weight, approx 0.04 g"]. >>>> >>>> The latter way of modelling should be in line with the documentation, >>>> but it makes large parts of OntoLex-Lemon redundant and others (e.g., >>>> canonicalForm) inapplicable, I would prefer to avoid that. >>>> >>>> Best, >>>> Christian >>>> >>>> Am Di., 7. Dez. 2021 um 16:32 Uhr schrieb Christian Chiarcos < >>>> christian.chiarcos@gmail.com>: >>>> >>>>> Dear all, >>>>> >>>>> for different use cases, I came across the need to provide one lexical >>>>> entry for multiple languages. >>>>> >>>>> In one group of cases (esp., etymological dictionaries), this can be >>>>> circumvented by using lexicog:Entry, instead, and then point to >>>>> language-specific lexical entries. (Though this is very inelegant, >>>>> unnecessarily verbose and clearly a departure from/obfuscation of the >>>>> original structure of the lexical resource, but technically, it is a >>>>> possibility.) >>>>> >>>>> However, in another case (dictionaries/glossaries for cuneiform >>>>> languages), we have the problem that we cannot always tell what language a >>>>> text (and thus, a word) is in. This is because of the multilingual >>>>> situation of Sumerian and Akkadian during the 3rd m. BC, because of the use >>>>> of ideographic signs, because of the laziness of scribes to often not write >>>>> morphemes, but just the stem of a word, and because of the habit of >>>>> Akkadian and Hittite scibes to just write Sumerian (or Akkadian) words >>>>> instead of their native tongue because these were more established in the >>>>> writing tradition. Although there are phonological or morphological >>>>> complements that can reveal the language, these are not systematically >>>>> used, so that we have uncertainties about the language of individual words >>>>> or even entire texts. However, if these texts form the basis for a glossary >>>>> or dictionary, these uncertainties percolate to the glossary, especially if >>>>> it is corpus-based. The Electronic Penn Sumerian dictionary thus does not >>>>> distinguish Sumerian and Akkadian forms and just groups everything under >>>>> the same head word and just provides Sumerian and Akkadian readings of the >>>>> same sign. (The selection of texts is such that a Sumerian reading is more >>>>> likely, but it is not always necessary.) In some cases in this dictionary, >>>>> it is even marked that there are doubts that a word is Sumerian in the >>>>> first place (http://oracc.museum.upenn.edu/epsd2/cbd/sux/o0023151.html >>>>> ). >>>>> >>>>> Such data does not allow to create distinct lexical entries for both >>>>> (or, in case of Hittite texts, three) languages that would just go under >>>>> the same lexicog:Entry, because we cannot decide which information (other >>>>> than the possible Sumerian and Akkadian interpretations of the same >>>>> Cuneiform writtenRep) belongs to which lexical entry. >>>>> >>>>> For this reason, we are currently considering to have >>>>> language-agnostic lexical entries for a future CDLI glossary ( >>>>> https://cdli.ucla.edu/), where language information is provided only >>>>> at the form (or even, within the writtenRep), but not at the lexical entry. >>>>> Note that there is no constraint in the OntoLex core model that requires a >>>>> single language per lexical entry. >>>>> >>>>> What OntoLex says about language is not in the core model, but in >>>>> Lime: "note that all entries in the same lexicon should be in the same >>>>> language and that the language of the lexicon and entry should be >>>>> consistent with the language tags used on all forms". This a comment (in >>>>> parenthesis, in accompanying text, and if assumed to be relevant for the >>>>> definition of ontolex:LexicalEntry, in the wrong place), formulated as a >>>>> recommendation and not part of any definition. >>>>> >>>>> If we consider this statement to be nevertheless binding, the CDLI >>>>> solution would be to create a dictionary with senses and lexicog:Entrys, >>>>> but without ontolex:Entrys. I would prefer not to. (I would still prefer to >>>>> avoid multilingual lexical entries in cases in which language-specific >>>>> information is provided, and thus to keep the recommendation in place, as >>>>> is, but this is not the case here.) >>>>> >>>>> Best, >>>>> Christian >>>>> >>>> > > -- > Francis Bond <https://fcbond.github.io/> > Division of Linguistics and Multilingual Studies > Nanyang Technological University >
Received on Thursday, 6 January 2022 05:37:18 UTC