W3C home > Mailing lists > Public > public-ontolex@w3.org > August 2020

Re: [open-linguistics] Re: ISO 639 URIs

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Fri, 07 Aug 2020 16:27:29 +0200
To: "'Felix Sasaki'" <felix@sasakiatcf.com>, "Ronan Power" <Ronan@translation.ie>
Cc: "santhosh.thottingal@gmail.com" <santhosh.thottingal@gmail.com>, open-linguistics <open-linguistics@googlegroups.com>, "Linked Data for Language Technology Community Group" <public-ld4lt@w3.org>, "public-ontolex@w3.org" <public-ontolex@w3.org>
Message-ID: <op.0ozg332hbr5td5@kitaba>
Am .08.2020, 15:30 Uhr, schrieb Ronan Power <Ronan@translation.ie>:

>
> Hi, I wrote on this before to the group:
>
> I think it’s important to realise that ISO639-3 does indeed have its  
> problems, not least of which is the “apparent” descriptor<>tag >mismatch  
> as do the alternatives and variants, and it is confusing.

Yes, I think most linguists who work on non-major languages have  
encountered such problems (if they tried to make the language explicit,  
that is). Yet, for the moment, ISO 639 is extremely important in that it  
is an inventory that is agreed upon. Despite its flaws, reaching agreement  
on another system would be a massive undertaking, if possible at all.

>
> This really boils down to the creation and agreement of a source index  
> of identifiers for languages, dialects, written languages and >scripts,  
> of which to my knowledge no such system has yet been completed  
> thoroughly.

The closest thing to that is Glottolog, and it does a good job on minority  
languages, but not so much on historical languages.

Actually, one nice thing about BCP47 is that is allows to provide custom  
language tags, and I recently found myself creating such language tags by  
combining the closest BCP47 language tag with the actual Glottocode, e.g.,

"как бази"@av-x-ancu1238 for the Ancux dialect (Glottocode  ancu1238) of  
Avar (ISO 639-3 ava, ISO 639-1 av)

but

"кокази"@av for "standard" Avar (Glottocode avar1256, ISO 639-3 ava, ISO  
639-1 av)

And for languages for which no ISO 639 code can be found, (e.g., Okinawan,  
because this is not a dialect of Japanese in Glottolog but a sibling  
language in the Japonic language family), the placeholder tag "mis"  
(uncoded) can used, i.e., mis-x-okin1244

This is nice insofar as this approach allows to provide a BCP47 code for  
every Glottolog language variety without information loss (because we can  
retrieve the ISO 639-3 code for every Glottolog languoid by finding the  
closest parent node that has an ISO 639-3 code attached, and from there,  
we can find the ISO 639-1 codes using SIL conversion tables or lexvo).

And not only does that approach use conventional BCP47 tags wherever  
possible, but the custom extension with Glottolog yields actually valid  
BCP47 tags, too (after -x- you can add whatever you like). Moreover, it is  
possible to resolve this to a URI (because all Glottocodes do,  
https://glottolog.org/resource/languoid/id/ancu1238, and we can retrieve  
the Glottocode for every ISO 639-3 code from Glottolog [and, via, SIL  
conversion tables, from ISO 639-1 codes]).

The remaining difficulties are that
(a) Glottolog is far from perfect either [for historical languages, the  
Glottolog classification tries to harmonize diachronic and synchronic  
relations, and this does not always lead to a consistent result],
(b) this is a hack rather than a solution, because there is no formal way  
to assert that the elements following -x- are Glottocodes in BCP47, and
(c) if we know that something is a Glottocode, we can *reconstruct* its  
URI and browse the Glottolog classification, but there is no good way to  
make this information explicit.

A nicer solution, therefore, would be to just link an entry with the URIs  
for ISO language code, Glottocode and other metadata, so that all  
information is explicit ;) Having standard URIs for ISO 639 tags would be  
the first step.

Best,
Christian
Received on Friday, 7 August 2020 14:27:48 UTC

This archive was generated by hypermail 2.4.0 : Friday, 7 August 2020 14:27:49 UTC