W3C home > Mailing lists > Public > semantic-web@w3.org > December 2016

RE: Clarification about language tag

From: Doug Ewell <doug@ewellic.org>
Date: Tue, 20 Dec 2016 11:32:02 -0700
Message-ID: <y2fm4cnyomtusmai7eatpb3a.1482258722522@email.android.com>
To: Misha.Wolf@thomsonreuters.com, chiarcos@informatik.uni-frankfurt.de, mvalle@cscs.ch
Cc: ietf-languages@iana.org, semantic-web@w3.org, christian.chiarcos@web.de
And if you are following BCP 47, you need to use the IANA Language Subtag Registry as the source of language subtags, and not worry about ISO 639-1 and 639-3 and the rest.
--Doug Ewell | Thornton, CO, US | ewellic.org
-------- Original message --------From: Misha.Wolf@thomsonreuters.com Date: 12/20/16  10:34  (GMT-07:00) To: chiarcos@informatik.uni-frankfurt.de, mvalle@cscs.ch Cc: ietf-languages@iana.org, semantic-web@w3.org, christian.chiarcos@web.de Subject: RE: Clarification about language tag 
+ ietf-languages@iana.org

Whether you work with modern-day languages or any other languages, you 
must follow BCP47.

And, in following BCP-47, you must be prepared to use whatever language 
tag length is required, not assume that all language tags will have the 
same length.

Consider, for example, these longer language tags:
-  sr-Cyrl = Serbian (Cyrillic)
-  sr-Latn = Serbian (Latin)
-  uz-Cyrl = Uzbek (Cyrillic)
-  uz-Latn = Uzbek (Latin)


-----Original Message-----
From: Christian Chiarcos [mailto:chiarcos@informatik.uni-frankfurt.de] 
Sent: 20 December 2016 17:19
To: semantic-web@w3.org Web; Mario Valle
Cc: christian.chiarcos@web.de
Subject: Re: Clarification about language tag

Dear Mario,

> In Turtle syntax the @lang tag syntax refers to BCP47 that states:
> language      = 2*3ALPHA            ; shortest ISO 639 code
> That is, the language code (I ignore all the variants here) should be 2  
> or 3 characters.

This means you should use the two-letter code for a language that has one  
(@en) even if it does have a three-letter code (@eng). Not every language  
does have a two-letter code.

> Indeed ISO 639 (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.loc.gov_standards_iso639-2D2_php_code-5Flist.php&d=CwIFbA&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=TEzQYtkmHF-FAqtk-AbmPZIVKuLy0UGpXXfHOfCIwQ0&e= )  
> lists both 2 and 3 chars codes (e.g., English: 'en' and 'eng').
> But in all Turtle examples I have found the language code has 2 chars.  
> Is it a requirement or is simply a tradition? This means, could I write  
> "Pancake"@eng?
> The question arises because WordNet contains 3 chars codes, so to  
> transform into triples, should/shouldn't I convert it to 2 characters?

The reason is that the 2-character codes are insufficient from the  
perspective of multilingual NLP or linguistics where ISO 639-3 is much  
more established (and somewhat better defined) than ISO 639-1 2-letter  
codes. Therefore, people developing language resources (like WordNet)  
sometimes tend to neglect ISO 639-1 codes altogether. I also went that way  
at times. In terms of BCP47, however, this is a mistake and should be  
fixed. As long as you work with modern-day major languages only and you  
don't see issues with the 2-letter codes for your task/resource, you  
should definitely follow BCP47 and use 2-letter codes wherever possible.


> Thanks for your patience
> 				mario

Prof. Dr. Christian Chiarcos
Applied Computational Linguistics
Johann Wolfgang Goethe Universität Frankfurt a. M.
60054 Frankfurt am Main, Germany

office: Robert-Mayer-Str. 10, #401b
mail: chiarcos@informatik.uni-frankfurt.de
web: https://urldefense.proofpoint.com/v2/url?u=http-3A__acoli.cs.uni-2Dfrankfurt.de&d=CwIFbA&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=SYYlim1HJWSJMzRcHsHxPJTJurnKt2vFAm48s952MLA&e= 
tel: +49-(0)69-798-22463
fax: +49-(0)69-798-28931

Ietf-languages mailing list
Received on Tuesday, 20 December 2016 21:43:59 UTC

This archive was generated by hypermail 2.4.0 : Tuesday, 5 July 2022 08:45:48 UTC