Re: Question on language tags and directionality metadata from Felix Sasaki on 2020-04-07 (public-ontolex@w3.org from April 2020)

From: Felix Sasaki <felix@sasakiatcf.com>
Date: Tue, 7 Apr 2020 16:05:46 +0200
To: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>
Cc: public-ontolex@w3.org, "christian.chiarcos@gmail.com" <christian.chiarcos@gmail.com>
Message-ID: <CAL58czrhkxXOWYMjUV9==56GEvSDMt=iagawu5=aEpn1N1z2mw@mail.gmail.com>
Thanks a lot, Christian, very helpful. Some comments below.

On Mon, 6 Apr 2020 at 15:56, Christian Chiarcos <
chiarcos@informatik.uni-frankfurt.de> wrote:

> Hi Felix,
>
> Am .04.2020, 07:25 Uhr, schrieb Felix Sasaki <felix@sasakiatcf.com>:
>
> I am still involved in W3C, in the internationalization activity. Here
> recently a question came up on BCP 47, the IETF standard for language tags
> including the related sub tag registry, and RDF approaches to represent
> information about language.
>
> In RDF, of course you can use BCP 47 language tags for literals, but there
> are valuable resources like Lexvo that identify languages via URIs. Often
> these resources are based on ISO standards and have no direct relation to
> BCP 47. This leads also to fragmentation, for example since BCP 47 includes
> sub tags that are not part of a given ISO standard for languages or regions.
>
> In this context, I have a few questions:
>
> 1) Do you know of any best practices & use cases for using URIs (from
> Lexvo or other sources) in an RDF context? By "using" I mean using the URIs
> to identify the language of a (sub part of an) RDF graph.
>
>
> This is frequently the case when working with underresourced or historical
> languages. The granularity of BCP47 and/or ISO693 is simply insufficient
> and the categories too imprecise for many applications in linguistics. For
> low-resource languages and fine-grained language variety classification,
> Glottolog is relatively widely used. ISO639-6 that could have been applied
> here, is withdrawn. For historical languages, there is nothing in existence
> (and the diachronic dimension in Glottolog is non-satisfactory). In
> multilingual datasets, people may decide to go for URI-based encoding
> throughout for the sake of consistency (many languages without language
> tags). Another reason for using URIs is that these URIs are [or at least,
> can be] defined and verified, whereas some ISO693 labels can be interpreted
> differently (e.g., what is the difference between gmh and de?
> Traditionally, Middle High German extended from 11th - 15th c, nowadays
> many people prefer 11th - 14th c., so, without any more detailed definition
> than provided by ISO639/BCP47, people will disagree on whether something is
> gmh or de).
>


In understand the "historical languages" aspect: some of the historical
languages are just not covered by the language subtags that are covered in
the BCP 47 sub tag registry. I am not sure about the disagreement with
regards to ISO639/BCP47: Key people from the ISO639 community have been
involved in the development of BCP47 and assured that in the sub tag
registry there is "de", and that content that is German should be just
tagged with "de". That of course does not solve the issue with "gmh" versus
"de".


>
> When using URIs, most people will probably prefer ISO639-3 because it is
> more established than Glottolog, and linguistically more fine-grained than
> ISO639-1 and ISO639-2 (and it doesn't need to follow the complex
> composition and selection rules of BCP47). The Library of Congress provides
> ISO693-1 and ISO639-2 URIs, but SIL (for ISO639-3) does not (AFAIK). This
> is why we normally go for Lexvo, although it's a bit behind ISO639-3.
>
> 2) Are there any recommendations like: "here use URIs, here use BCP 47"?
> For what I found, the main use case of URIs to express information *about*
> languages as first-class objects, but not to attach language information to
> other parts of an RDF graph - see 1) above.
>
> 3) In addition to language, there is other type of metadata needed in an
> i18n context, e.g. metadata about directionality of strings. Do you now
> about best practices for representing such metadata in RDF?
>
> 4) In an "identify language via URIs" approach, how would one identify the
> entries of the BCP 47 sub tag registry that do no have an URI?
>
>
> Provide URIs via the registry.
>


For subtags, that makes sense. I discussed this with the i18n folks at W3C,
and the issues are not the sub tags but the language tags: these rely on a
generative mechanism (an ABNF in BCP47) that allows to generate an
infinitve number of language tags, based on sub tags. Then there are
constraints like "de-1901 is OK, but en-1901 ist not OK". it would be hard
to provide URIs for all (useful or not useful combinations) of the sub
tags.


>
> 5) Is there an authority for language related URIs?
>
>
> ISO 639-1: Library of Congress
> ISO 639-2: Library of Congress
> ISO 639-3: SIL (not URIs, though)
> Beyond that, everything is a matter of discussion and perspective, but
> Glottolog is a good starting point.
>
> NB: There was a discussion about revising language tags in an RDF context (
> https://github.com/w3c/EasierRDF/issues/22). My personal preference would
> be to permit URIs *as* language tags and to interpret every language tag as
> a URI in a specific namespace (unless another namespace is declared). This
> may be a little radical, but we could keep the current (Turtle/SPARQL)
> notation in this way.
>


Understand - yeah, anything that allows to continue to work with current
systems is great :)

Thanks again for your feedback, very much appreciated - I will come back
with further feedback from w3c i18n activity folks again.

Cheers,

Felix


>
> Best,
> Christian
> --
> Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universität Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
>
> office: Robert-Mayer-Str. 11-15, #107
> mail: chiarcos@informatik.uni-frankfurt.de
> web: http://acoli.cs.uni-frankfurt.de
> tel: +49-(0)69-798-22463
> fax: +49-(0)69-798-28334
>
Received on Tuesday, 7 April 2020 14:06:00 UTC