- From: <Misha.Wolf@thomsonreuters.com>
- Date: Tue, 20 Dec 2016 17:34:32 +0000
- To: <chiarcos@informatik.uni-frankfurt.de>, <mvalle@cscs.ch>
- CC: <christian.chiarcos@web.de>, <semantic-web@w3.org>, <ietf-languages@iana.org>
+ ietf-languages@iana.org Whether you work with modern-day languages or any other languages, you must follow BCP47. And, in following BCP-47, you must be prepared to use whatever language tag length is required, not assume that all language tags will have the same length. Consider, for example, these longer language tags: - sr-Cyrl = Serbian (Cyrillic) - sr-Latn = Serbian (Latin) - uz-Cyrl = Uzbek (Cyrillic) - uz-Latn = Uzbek (Latin) Regards, Misha -----Original Message----- From: Christian Chiarcos [mailto:chiarcos@informatik.uni-frankfurt.de] Sent: 20 December 2016 17:19 To: semantic-web@w3.org Web; Mario Valle Cc: christian.chiarcos@web.de Subject: Re: Clarification about language tag Dear Mario, > In Turtle syntax the @lang tag syntax refers to BCP47 that states: > > language = 2*3ALPHA ; shortest ISO 639 code > > That is, the language code (I ignore all the variants here) should be 2 > or 3 characters. This means you should use the two-letter code for a language that has one (@en) even if it does have a three-letter code (@eng). Not every language does have a two-letter code. > Indeed ISO 639 (https://urldefense.proofpoint.com/v2/url?u=http-3A__www.loc.gov_standards_iso639-2D2_php_code-5Flist.php&d=CwIFbA&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=TEzQYtkmHF-FAqtk-AbmPZIVKuLy0UGpXXfHOfCIwQ0&e= ) > lists both 2 and 3 chars codes (e.g., English: 'en' and 'eng'). > > But in all Turtle examples I have found the language code has 2 chars. > Is it a requirement or is simply a tradition? This means, could I write > "Pancake"@eng? > > The question arises because WordNet contains 3 chars codes, so to > transform into triples, should/shouldn't I convert it to 2 characters? The reason is that the 2-character codes are insufficient from the perspective of multilingual NLP or linguistics where ISO 639-3 is much more established (and somewhat better defined) than ISO 639-1 2-letter codes. Therefore, people developing language resources (like WordNet) sometimes tend to neglect ISO 639-1 codes altogether. I also went that way at times. In terms of BCP47, however, this is a mistake and should be fixed. As long as you work with modern-day major languages only and you don't see issues with the 2-letter codes for your task/resource, you should definitely follow BCP47 and use 2-letter codes wherever possible. Best, Christian > > Thanks for your patience > > mario > -- Prof. Dr. Christian Chiarcos Applied Computational Linguistics Johann Wolfgang Goethe Universität Frankfurt a. M. 60054 Frankfurt am Main, Germany office: Robert-Mayer-Str. 10, #401b mail: chiarcos@informatik.uni-frankfurt.de web: https://urldefense.proofpoint.com/v2/url?u=http-3A__acoli.cs.uni-2Dfrankfurt.de&d=CwIFbA&c=4ZIZThykDLcoWk-GVjSLm9hvvvzvGv0FLoWSRuCSs5Q&r=VsO6ShdzLK20Tv5zCK2CUVP_oB340q3grZz3gJtouLE&m=SnuMGpH9aJBpJG4G8i3x6v1GSkDNibOUkGqj0zyTv_o&s=SYYlim1HJWSJMzRcHsHxPJTJurnKt2vFAm48s952MLA&e= tel: +49-(0)69-798-22463 fax: +49-(0)69-798-28931
Received on Tuesday, 20 December 2016 17:35:26 UTC