Re: Clarification about language tag

From: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de> · Date: Tue, 20 Dec 2016 23:20:52 +0100

Am .12.2016, 19:32 Uhr, schrieb Doug Ewell <doug@ewellic.org>:

> And if you are following BCP 47, you need to use the IANA Language  
> Subtag Registry as the source of language subtags, and not worry about  
> ISO >639-1 and 639-3 and the rest.

+1

Deviating from this is literally tag abuse. Bad practice, though  
sometimes, it does happen -- and normally for a reason.
However, if existing language tags really aren't sufficient, it is better  
*not* to use a (potentially misleading or ill-defined) language tag, but  
to represent language variety information of a resource abc:xyz with an  
explicit link to a repository such as Glottolog:

abc:xyz dcterms:language  
<http://glottolog.org/resource/languoid/id/stan1293>.

Christian

>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
> -------- Original message --------
> From: Misha.Wolf@thomsonreuters.comDate: 12/20/16 10:34 (GMT-07:00)To:  
> chiarcos@informatik.uni-frankfurt.de, mvalle@cscs.chCc:  
> ietf-languages@iana.org, semantic-web@w3.org, christian.chiarcos@web.de 
> Subject: RE: Clarification about language tag
> + ietf-languages@iana.org
>
> Whether you work with modern-day languages or any other languages, you 
> must follow BCP47.
>
> And, in following BCP-47, you must be prepared to use whatever language 
> tag length is required, not assume that all language tags will have the 
> same length.
>
> Consider, for example, these longer language tags:
> -  sr-Cyrl = Serbian (Cyrillic)
> -  sr-Latn = Serbian (Latin)
> -  uz-Cyrl = Uzbek (Cyrillic)
> -  uz-Latn = Uzbek (Latin)
>
> Regards,
> Misha
>
>
> -----Original Message-----
> From: Christian Chiarcos [mailto:chiarcos@informatik.uni-frankfurt.de] 
> Sent: 20 December 2016 17:19
> To: semantic-web@w3.org Web; Mario Valle
> Cc: christian.chiarcos@web.de
> Subject: Re: Clarification about language tag
>
> Dear Mario,
>
>> In Turtle syntax the @lang tag syntax refers to BCP47 that states:
>>
>> language      = 2*3ALPHA            ; shortest ISO 639 code
>>
>> That is, the language code (I ignore all the variants here) should be 2  
>> or 3 characters.
>
> This means you should use the two-letter code for a language that has  
> one (@en) even if it does have a three-letter code (@eng). Not every  
> language does have a two-letter code.
>
>> Indeed ISO 639  
>> (https://urldefense.proofpoint.com/v2/url?>u=http-3A__www.loc.gov_standards_iso639-2D2_php_code-5Flist.php&d=CwIFbA&c=4ZIZThykDLcoWk->GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=TEzQYtkmHF->FAqtk-AbmPZIVKuLy0UGpXXfHOfCIwQ0&e=  
>> ) lists both 2 and 3 chars codes (e.g., English: 'en' and 'eng').
>>
>> But in all Turtle examples I have found the language code has 2 chars.  
>> Is it a requirement or is simply a tradition? This means, could I write  
>> "Pancake"@eng?
>>
>> The question arises because WordNet contains 3 chars codes, so to  
>> transform into triples, should/shouldn't I convert it to 2 characters?
>
> The reason is that the 2-character codes are insufficient from the  
> perspective of multilingual NLP or linguistics where ISO 639-3 is much  
> more established (and somewhat better defined) than ISO 639-1 2-letter  
> codes. Therefore, people developing language resources (like WordNet)  
> sometimes tend to neglect ISO 639-1 codes altogether. I also went that  
> way at times. In terms of BCP47, however, this is a mistake and should  
> be fixed. As long as you work with modern-day major languages only and  
> you don't see issues with the 2-letter codes for your task/resource, you  
> should definitely follow BCP47 and use 2-letter codes wherever possible.
>
> Best,
> Christian
>
>>
>> Thanks for your patience
>>
>> mario
>>
>
>
> --Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universität Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
>
> office: Robert-Mayer-Str. 10, #401b
> mail: chiarcos@informatik.uni-frankfurt.de
> web:  
> https://urldefense.proofpoint.com/v2/url?u=http-3A__acoli.cs.uni-2Dfrankfurt.de&d=CwIFbA&c=4ZIZThykDLcoWk->GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=SYYlim1HJWSJMzRcHsHxPJTJurnKt2vFAm48s952MLA&e= 
> tel: +49-(0)69-798-22463
> fax: +49-(0)69-798-28931
>
> _______________________________________________
> Ietf-languages mailing list
> Ietf-languages@alvestrand.no
> http://www.alvestrand.no/mailman/listinfo/ietf-languages

-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: http://acoli.cs.uni-frankfurt.de
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931