Re: Question on language tags and directionality metadata from Christian Chiarcos on 2020-04-07 (public-ontolex@w3.org from April 2020)

From: Christian Chiarcos <christian.chiarcos@web.de>
Date: Tue, 07 Apr 2020 17:45:29 +0200
To: "Felix Sasaki" <felix@sasakiatcf.com>
Cc: public-ontolex@w3.org
Message-ID: <op.0ipnd3babr5td5@kitaba>
Am .04.2020, 16:05 Uhr, schrieb Felix Sasaki <felix@sasakiatcf.com>:

> Thanks a lot, Christian, very helpful. Some comments below.

Sure ;) Some more inline :)

>
> On Mon, 6 Apr 2020 at 15:56, Christian Chiarcos  
> <chiarcos@informatik.uni-frankfurt.de> wrote:
>> Hi Felix,
>>
>> Am .04.2020, 07:25 Uhr, schrieb Felix Sasaki <felix@sasakiatcf.com>:
>>
>>> I am still involved in W3C, in the internationalization activity. Here  
>>> recently a question came up on BCP 47, the IETF standard for language  
>>> tags >>>including the related sub tag registry, and RDF approaches to  
>>> represent information about language.
>>>
>>> In RDF, of course you can use BCP 47 language tags for literals, but  
>>> there are valuable resources like Lexvo that identify languages via  
>>> URIs. Often >>>these resources are based on ISO standards and have no  
>>> direct relation to BCP 47. This leads also to fragmentation, for  
>>> example since BCP 47 >>>includes sub tags that are not part of a given  
>>> ISO standard for languages or regions.
>>>
>>> In this context, I have a few questions:
>>>
>>> 1) Do you know of any best practices & use cases for using URIs (from  
>>> Lexvo or other sources) in an RDF context? By "using" I mean using the  
>>> URIs >>>to identify the language of a (sub part of an) RDF graph.
>>
>> This is frequently the case when working with underresourced or  
>> historical languages. The granularity of BCP47 and/or ISO693 is simply  
>> insufficient >>and the categories too imprecise for many applications  
>> in linguistics. For low-resource languages and fine-grained language  
>> variety classification, >>Glottolog is relatively widely used. ISO639-6  
>> that could have been applied here, is withdrawn. For historical  
>> languages, there is nothing in existence >>(and the diachronic  
>> dimension in Glottolog is non-satisfactory). In multilingual datasets,  
>> people may decide to go for URI-based encoding throughout >>for the  
>> sake of consistency (many languages without language tags). Another  
>> reason for using URIs is that these URIs are [or at least, can be]  
>> defined >>and verified, whereas some ISO693 labels can be interpreted  
>> differently (e.g., what is the difference between gmh and de?  
>> Traditionally, Middle High >>German extended from 11th - 15th c,  
>> nowadays many people prefer 11th - 14th c., so, without any more  
>> detailed definition than provided by ISO639/>>BCP47, people will  
>> disagree on whether something is gmh or de).
>
>
> In understand the "historical languages" aspect: some of the historical  
> languages are just not covered by the language subtags that are covered  
> in the >BCP 47 sub tag registry. I am not sure about the disagreement  
> with regards to ISO639/BCP47: Key people from the ISO639 community have  
> been involved >in the development of BCP47 and assured that in the sub  
> tag registry there is "de", and that content that is German should be  
> just tagged with "de". That >of course does not solve the issue with  
> "gmh" versus "de".

There is de. If gmh is not, it can (and should) be added in accordance  
with §3.5 BCP47. Simply because Middle High German really is a language  
different from Modern High German with a different orthography and  
different phonology. Not as far away as Italian from Latin, but something  
in this direction. But the definition aspect is crucial and ISO639 does  
not provide definitions, just labels, and neither does the subtag registry  
(which is synchronized with ISO639). I just checked. SIL included time  
information for gmh, but only *in the label* (ISO639-3 gmh is "Middle High  
German (ca. 1050-1500)"), but this is not provided in a machine-readable  
way, and different from the way how, for example, the Reference Corpus  
Middle High German defines "Middle High German"  
(https://www.linguistics.rub.de/rem/: 1050-1350).

>
>> When using URIs, most people will probably prefer ISO639-3 because it  
>> is more established than Glottolog, and linguistically more  
>> fine-grained than >>ISO639-1 and ISO639-2 (and it doesn't need to  
>> follow the complex composition and selection rules of BCP47). The  
>> Library of Congress provides >>ISO693-1 and ISO639-2 URIs, but SIL (for  
>> ISO639-3) does not (AFAIK). This is why we normally go for Lexvo,  
>> although it's a bit behind ISO639-3.
>>
>>> 2) Are there any recommendations like: "here use URIs, here use BCP  
>>> 47"? For what I found, the main use case of URIs to express  
>>> information >>>*about* languages as first-class objects, but not to  
>>> attach language information to other parts of an RDF graph - see 1)  
>>> above.
>>>
>>> 3) In addition to language, there is other type of metadata needed in  
>>> an i18n context, e.g. metadata about directionality of strings. Do you  
>>> now about >>>best practices for representing such metadata in RDF?
>>>
>>> 4) In an "identify language via URIs" approach, how would one identify  
>>> the entries of the BCP 47 sub tag registry that do no have an URI?
>>
>> Provide URIs via the registry.
>
>
> For subtags, that makes sense. I discussed this with the i18n folks at  
> W3C, and the issues are not the sub tags but the language tags: these  
> rely on a >generative mechanism (an ABNF in BCP47) that allows to  
> generate an infinitve number of language tags, based on sub tags. Then  
> there are constraints >like "de-1901 is OK, but en-1901 ist not OK". it  
> would be hard to provide URIs for all (useful or not useful  
> combinations) of the sub tags.

I agree, the matching and filtering rules defined by BCP47 could be hard  
to reproduce with URIs. The question is, however, whether a URI-based  
approach would have to implement the combination rules in the first place,  
or whether it would not be better to just represent each subtag with a  
separate URI. There already are types defined in  
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry,  
so that could be easily represented as a (simple) ontology.

To describe the combination of subtags in an explicit and machine-readable  
way (rather than by packing it in a rather opaque language tag) could also  
help, for example, to eliminate a certain ambiguity that BCP47 (in my  
view) introduces: Regarding the combination of language tags and region  
codes, sometimes, region codes seem to be abused to define dialects,  
sometimes just the region according to whose standards (or where?)  
something is written. I assume that only the latter is the intended  
function. Conflating those aspects is problematic if a region features  
both a regional orthography of a standard language and a (different)  
dialect of the same language. This is the situation for Switzerland, where  
Swiss German "ussen" (outside) corresponds to Standard German "außen", but  
written "aussen" in Swiss orthography.

For this particular case, this is solved by giving Swiss German a separate  
language tag (gsw), so that we have "ussen"@gsw-CH, "außen"@de-DE,  
"aussen"@de-CH, but coverage of ISO639 (and by extension, BCP47) for  
dialects outside central Europe is *much* more limited. I think there may  
be an issue of this kind with Åland Swedish, which is a Swedish dialect  
spoken in Finnland, but different from Finnish Swedish (according to  
https://en.wikipedia.org/wiki/%C3%85land_Swedish), and it does not have  
seem to a separate language tag. If explicit properties are used to define  
the region of a particular language, then dialectal and orthographical  
variation can be more clearly distinguished by using different properties,  
say xyz:spokenIn (for dialectal variants) vs. xyz:writtenIn (for  
orthographical variants) [there may be better names for those properties  
...].

BTW: How is language tag validation performed at the moment? If this is a  
dynamic process anyway, then, instead of providing a full list of URIs for  
all possible combinations, the registry could just feature a service that  
just checks whether a combination is ok (and returns some structured  
representation) and an error otherwise.

>
>>
>>> 5) Is there an authority for language related URIs?
>>
>> ISO 639-1: Library of Congress
>> ISO 639-2: Library of Congress
>> ISO 639-3: SIL (not URIs, though)
>> Beyond that, everything is a matter of discussion and perspective, but  
>> Glottolog is a good starting point.
>>
>> NB: There was a discussion about revising language tags in an RDF  
>> context (https://github.com/w3c/EasierRDF/issues/22). My personal  
>> preference >>would be to permit URIs *as* language tags and to  
>> interpret every language tag as a URI in a specific namespace (unless  
>> another namespace is >>declared). This may be a little radical, but we  
>> could keep the current (Turtle/SPARQL) notation in this way.
>
>
> Understand - yeah, anything that allows to continue to work with current  
> systems is great :)
>
> Thanks again for your feedback, very much appreciated - I will come back  
> with further feedback from w3c i18n activity folks again.

Great, looking forward to it ;)

Best,
Christian


>
> Cheers,
>
> Felix
>
>>
>> Best,
>> Christian
>> --Prof. Dr. Christian Chiarcos
>> Applied Computational Linguistics
>> Johann Wolfgang Goethe Universität Frankfurt a. M.
>> 60054 Frankfurt am Main, Germany
>>
>> office: Robert-Mayer-Str. 11-15, #107
>> mail: chiarcos@informatik.uni-frankfurt.de
>> web: http://acoli.cs.uni-frankfurt.de
>> tel: +49-(0)69-798-22463
>> fax: +49-(0)69-798-28334
Received on Tuesday, 7 April 2020 15:45:48 UTC