W3C home > Mailing lists > Public > public-ontolex@w3.org > March 2016

Re: Question: replacing language codes in a SPARQL BIND statement?

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 17 Mar 2016 14:38:14 +0100
Cc: public-ontolex@w3.org, "A list for those interested in open data in linguistics." <open-linguistics@lists.okfn.org>
Message-Id: <FA7639A2-EF15-43FD-A6B3-C223D44C8B5A@w3.org>
To: Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>

> Am 17.03.2016 um 12:20 schrieb Christian Chiarcos <chiarcos@informatik.uni-frankfurt.de>:
> 
> Hi Felix,
> 
> thanks for correcting me, I was oversimplifying with a hypothetical example, and wrongly, actually. In fact, BCP 47 states that 
> 
>  "When languages have both an ISO 639-1 two-character code and a three-
> character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only
> the ISO 639-1 two-character code is defined in the IANA registry."
> 
> xml:lang allows only for BCP 47 language tags, and here the options you describe (e.g. ISO-639-3 vs. IS0-639-2) are not available. So if you use a language tag validator you can at least detect that an xml:lang value is not valid.
> 
> The conversion issue, however, remains with BCP 47, as soon as extended language subtags are involved:
> 
> "Extended language subtags are used to identify certain specially selected languages that, for various historical and compatibility reasons, are closely identified with or tagged using an existing primary language subtag. Extended language subtags are always used with their enclosing primary language subtag (indicated with a 'Prefix' field in the registry) when used to form the language tag. ...
> For example, the macrolanguage Chinese ('zh') encompasses a number of languages. For compatibility reasons, each of these languages has both a primary and extended language subtag in the registry. A few selected examples of these include Gan Chinese ('gan'), Cantonese Chinese ('yue'), and Mandarin Chinese ('cmn'). Each is encompassed by the macrolanguage 'zh' (Chinese). Therefore, they each have the prefix "zh" in their registry records. Thus, Gan Chinese is represented with tags beginning "zh-gan" or "gan", Cantonese with tags beginning either "yue" or "zh-yue", and Mandarin Chinese with "zh-cmn" or "cmn"." 
> 
> Quotes from http://www.rfc-editor.org/rfc/bcp/bcp47.txt <http://www.rfc-editor.org/rfc/bcp/bcp47.txt> (resp.  https://tools.ietf.org/html/rfc5646 <https://tools.ietf.org/html/rfc5646>).
> 
> https://validator.w3.org/#validate_by_input <https://validator.w3.org/#validate_by_input>
> 
> The validator actually complains about "zh-gan": "Potentially bad value zh-gan for attribute lang on element html <http://www.w3.org/html/wg/drafts/html/master/single-page.html#the-html-element>: The language tag zh-gan is deprecated. Use gan instead." (This might be incorrect as it refers to the very text from which I got it as recommendation, see above.)


Yes, that is correct. zh-gan is falling into the category of redundant subtags
https://tools.ietf.org/html/bcp47#section-2.2.8 <https://tools.ietf.org/html/bcp47#section-2.2.8>
at 
http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry <http://www.iana.org/assignments/language-subtag-registry/language-subtag-registry>
you find an entry for zh-gan 
%%
Type: redundant
Tag: zh-gan
Description: Kan or Gan
Added: 1999-12-18
Deprecated: 2009-07-29
Preferred-Value: gan
%%

the validator.nu library processes that information and produces the warning. This is not an error, the language tag is still valid.

Best,

Felix 

> 
> But anyway, there is a 3-letter-to-2-letter conversion required, if we want to treat lexical forms from sub-varieties of Chinese (like gan) like "ordinary" zh.
> 
> But the underlying library
> https://about.validator.nu/ <https://about.validator.nu/>
> has a class to validate language tags on its own.
> 
> That will certainly help. Thanks to all for responding, I have a much clearer picture of language tags now.
> 
> Thanks a lot,
> Christian
> -- 
> Prof. Dr. Christian Chiarcos
> Applied Computational Linguistics
> Johann Wolfgang Goethe Universit├Ąt Frankfurt a. M.
> 60054 Frankfurt am Main, Germany
> 
> office: Robert-Mayer-Str. 10, #401b
> mail: chiarcos@informatik.uni-frankfurt.de <mailto:chiarcos@informatik.uni-frankfurt.de>
> web: http://acoli.cs.uni-frankfurt.de <http://acoli.cs.uni-frankfurt.de/>
> tel: +49-(0)69-798-22463
> fax: +49-(0)69-798-28931
Received on Thursday, 17 March 2016 13:38:31 UTC

This archive was generated by hypermail 2.3.1 : Monday, 23 October 2017 10:57:39 UTC