Re: Question on language tags and directionality metadata from Christian Chiarcos on 2020-04-06 (public-ontolex@w3.org from April 2020)

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Date: Mon, 06 Apr 2020 15:56:58 +0200
To: public-ontolex@w3.org, "Felix Sasaki" <felix@sasakiatcf.com>
Cc: "christian.chiarcos@gmail.com" <christian.chiarcos@gmail.com>
Message-ID: <op.0inno8ko89jat0@kitaba>
Hi Felix,

Am .04.2020, 07:25 Uhr, schrieb Felix Sasaki <felix@sasakiatcf.com>:

> I am still involved in W3C, in the internationalization activity. Here  
> recently a question came up on BCP 47, the IETF standard for language  
> tags including >the related sub tag registry, and RDF approaches to  
> represent information about language.
>
> In RDF, of course you can use BCP 47 language tags for literals, but  
> there are valuable resources like Lexvo that identify languages via  
> URIs. Often these >resources are based on ISO standards and have no  
> direct relation to BCP 47. This leads also to fragmentation, for example  
> since BCP 47 includes sub tags >that are not part of a given ISO  
> standard for languages or regions.
>
> In this context, I have a few questions:
>
> 1) Do you know of any best practices & use cases for using URIs (from  
> Lexvo or other sources) in an RDF context? By "using" I mean using the  
> URIs to >identify the language of a (sub part of an) RDF graph.

This is frequently the case when working with underresourced or historical  
languages. The granularity of BCP47 and/or ISO693 is simply insufficient  
and the categories too imprecise for many applications in linguistics. For  
low-resource languages and fine-grained language variety classification,  
Glottolog is relatively widely used. ISO639-6 that could have been applied  
here, is withdrawn. For historical languages, there is nothing in  
existence (and the diachronic dimension in Glottolog is non-satisfactory).  
In multilingual datasets, people may decide to go for URI-based encoding  
throughout for the sake of consistency (many languages without language  
tags). Another reason for using URIs is that these URIs are [or at least,  
can be] defined and verified, whereas some ISO693 labels can be  
interpreted differently (e.g., what is the difference between gmh and de?  
Traditionally, Middle High German extended from 11th - 15th c, nowadays  
many people prefer 11th - 14th c., so, without any more detailed  
definition than provided by ISO639/BCP47, people will disagree on whether  
something is gmh or de).

When using URIs, most people will probably prefer ISO639-3 because it is  
more established than Glottolog, and linguistically more fine-grained than  
ISO639-1 and ISO639-2 (and it doesn't need to follow the complex  
composition and selection rules of BCP47). The Library of Congress  
provides ISO693-1 and ISO639-2 URIs, but SIL (for ISO639-3) does not  
(AFAIK). This is why we normally go for Lexvo, although it's a bit behind  
ISO639-3.

> 2) Are there any recommendations like: "here use URIs, here use BCP 47"?  
> For what I found, the main use case of URIs to express information  
> *about* >languages as first-class objects, but not to attach language  
> information to other parts of an RDF graph - see 1) above.
>
> 3) In addition to language, there is other type of metadata needed in an  
> i18n context, e.g. metadata about directionality of strings. Do you now  
> about best >practices for representing such metadata in RDF?
>
> 4) In an "identify language via URIs" approach, how would one identify  
> the entries of the BCP 47 sub tag registry that do no have an URI?

Provide URIs via the registry.

> 5) Is there an authority for language related URIs?

ISO 639-1: Library of Congress
ISO 639-2: Library of Congress
ISO 639-3: SIL (not URIs, though)
Beyond that, everything is a matter of discussion and perspective, but  
Glottolog is a good starting point.

NB: There was a discussion about revising language tags in an RDF context  
(https://github.com/w3c/EasierRDF/issues/22). My personal preference would  
be to permit URIs *as* language tags and to interpret every language tag  
as a URI in a specific namespace (unless another namespace is declared).  
This may be a little radical, but we could keep the current  
(Turtle/SPARQL) notation in this way.

Best,
Christian
-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 11-15, #107
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28334
Received on Monday, 6 April 2020 13:57:16 UTC