- From: Christian Chiarcos <christian.chiarcos@web.de>
- Date: Fri, 07 Aug 2020 16:27:29 +0200
- To: "'Felix Sasaki'" <felix@sasakiatcf.com>, "Ronan Power" <Ronan@translation.ie>
- Cc: "santhosh.thottingal@gmail.com" <santhosh.thottingal@gmail.com>, open-linguistics <open-linguistics@googlegroups.com>, "Linked Data for Language Technology Community Group" <public-ld4lt@w3.org>, "public-ontolex@w3.org" <public-ontolex@w3.org>
- Message-ID: <op.0ozg332hbr5td5@kitaba>
Am .08.2020, 15:30 Uhr, schrieb Ronan Power <Ronan@translation.ie>: > > Hi, I wrote on this before to the group: > > I think it’s important to realise that ISO639-3 does indeed have its > problems, not least of which is the “apparent” descriptor<>tag >mismatch > as do the alternatives and variants, and it is confusing. Yes, I think most linguists who work on non-major languages have encountered such problems (if they tried to make the language explicit, that is). Yet, for the moment, ISO 639 is extremely important in that it is an inventory that is agreed upon. Despite its flaws, reaching agreement on another system would be a massive undertaking, if possible at all. > > This really boils down to the creation and agreement of a source index > of identifiers for languages, dialects, written languages and >scripts, > of which to my knowledge no such system has yet been completed > thoroughly. The closest thing to that is Glottolog, and it does a good job on minority languages, but not so much on historical languages. Actually, one nice thing about BCP47 is that is allows to provide custom language tags, and I recently found myself creating such language tags by combining the closest BCP47 language tag with the actual Glottocode, e.g., "как бази"@av-x-ancu1238 for the Ancux dialect (Glottocode ancu1238) of Avar (ISO 639-3 ava, ISO 639-1 av) but "кокази"@av for "standard" Avar (Glottocode avar1256, ISO 639-3 ava, ISO 639-1 av) And for languages for which no ISO 639 code can be found, (e.g., Okinawan, because this is not a dialect of Japanese in Glottolog but a sibling language in the Japonic language family), the placeholder tag "mis" (uncoded) can used, i.e., mis-x-okin1244 This is nice insofar as this approach allows to provide a BCP47 code for every Glottolog language variety without information loss (because we can retrieve the ISO 639-3 code for every Glottolog languoid by finding the closest parent node that has an ISO 639-3 code attached, and from there, we can find the ISO 639-1 codes using SIL conversion tables or lexvo). And not only does that approach use conventional BCP47 tags wherever possible, but the custom extension with Glottolog yields actually valid BCP47 tags, too (after -x- you can add whatever you like). Moreover, it is possible to resolve this to a URI (because all Glottocodes do, https://glottolog.org/resource/languoid/id/ancu1238, and we can retrieve the Glottocode for every ISO 639-3 code from Glottolog [and, via, SIL conversion tables, from ISO 639-1 codes]). The remaining difficulties are that (a) Glottolog is far from perfect either [for historical languages, the Glottolog classification tries to harmonize diachronic and synchronic relations, and this does not always lead to a consistent result], (b) this is a hack rather than a solution, because there is no formal way to assert that the elements following -x- are Glottocodes in BCP47, and (c) if we know that something is a Glottocode, we can *reconstruct* its URI and browse the Glottolog classification, but there is no good way to make this information explicit. A nicer solution, therefore, would be to just link an entry with the URIs for ISO language code, Glottocode and other metadata, so that all information is explicit ;) Having standard URIs for ISO 639 tags would be the first step. Best, Christian
Received on Friday, 7 August 2020 14:27:49 UTC