RE: Clarification about language tag

And if you are following BCP 47, you need to use the IANA Language Subtag Registry as the source of language subtags, and not worry about ISO 639-1 and 639-3 and the rest.
--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------From: Misha.Wolf@thomsonreuters.com Date: 12/20/16  10:34  (GMT-07:00) To: chiarcos@informatik.uni-frankfurt.de, mvalle@cscs.ch Cc: ietf-languages@iana.org, semantic-web@w3.org, christian.chiarcos@web.de Subject: RE: Clarification about language tag 
+ ietf-languages@iana.org

Whether you work with modern-day languages or any other languages, you 
must follow BCP47.

And, in following BCP-47, you must be prepared to use whatever language 
tag length is required, not assume that all language tags will have the 
same length.

Consider, for example, these longer language tags:
-  sr-Cyrl = Serbian (Cyrillic)
-  sr-Latn = Serbian (Latin)
-  uz-Cyrl = Uzbek (Cyrillic)
-  uz-Latn = Uzbek (Latin)

Regards,
Misha


-----Original Message-----
From: Christian Chiarcos [mailto:chiarcos@informatik.uni-frankfurt.de] 
Sent: 20 December 2016 17:19
To: semantic-web@w3.org Web; Mario Valle
Cc: christian.chiarcos@web.de
Subject: Re: Clarification about language tag

Dear Mario,

> In Turtle syntax the @lang tag syntax refers to BCP47 that states:
>
> language      = 2*3ALPHA            ; shortest ISO 639 code
>
> That is, the language code (I ignore all the variants here) should be 2  
> or 3 characters.

This means you should use the two-letter code for a language that has one  
(@en) even if it does have a three-letter code (@eng). Not every language  
does have a two-letter code.

> Indeed ISO 639 (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.loc.gov_standards_iso639-2D2_php_code-5Flist.php&d=CwIFbA&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=TEzQYtkmHF-FAqtk-AbmPZIVKuLy0UGpXXfHOfCIwQ0&e= )  
> lists both 2 and 3 chars codes (e.g., English: 'en' and 'eng').
>
> But in all Turtle examples I have found the language code has 2 chars.  
> Is it a requirement or is simply a tradition? This means, could I write  
> "Pancake"@eng?
>
> The question arises because WordNet contains 3 chars codes, so to  
> transform into triples, should/shouldn't I convert it to 2 characters?

The reason is that the 2-character codes are insufficient from the  
perspective of multilingual NLP or linguistics where ISO 639-3 is much  
more established (and somewhat better defined) than ISO 639-1 2-letter  
codes. Therefore, people developing language resources (like WordNet)  
sometimes tend to neglect ISO 639-1 codes altogether. I also went that way  
at times. In terms of BCP47, however, this is a mistake and should be  
fixed. As long as you work with modern-day major languages only and you  
don't see issues with the 2-letter codes for your task/resource, you  
should definitely follow BCP47 and use 2-letter codes wherever possible.

Best,
Christian

>
> Thanks for your patience
>
>     mario
>


-- 
Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: https://urldefense.proofpoint.com/v2/url?u=http-3A__acoli.cs.uni-2Dfrankfurt.de&d=CwIFbA&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=SYYlim1HJWSJMzRcHsHxPJTJurnKt2vFAm48s952MLA&e= 
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

_______________________________________________
Ietf-languages mailing list
Ietf-languages@alvestrand.no
http://www.alvestrand.no/mailman/listinfo/ietf-languages

Received on Tuesday, 20 December 2016 21:43:59 UTC