Re: telco this Friday from Felix Sasaki on 2014-01-30 (public-ontolex@w3.org from January 2014)

From: Felix Sasaki <fsasaki@w3.org>
Date: Thu, 30 Jan 2014 09:46:07 +0100
To: Gil Francopoulo <gil.francopoulo@wanadoo.fr>, public-ontolex@w3.org
Message-ID: <52EA114F.4020006@w3.org>
Am 30.01.14 09:37, schrieb Gil Francopoulo:
> Le 30/01/2014 09:18, Felix Sasaki a écrit :
>> Hi Gil, all,
>>
>> Am 30.01.14 09:12, schrieb Gil Francopoulo:
>>> Dear Philip and Lars,
>>>
>>> I agree with Lars.
>>>
>>> I suggest to take a look (and follow) IETF BCP 47 in the examples
>>
>> +1.
>>
>>> , where:
>>>
>>> * a language code is never in upper-case but in lower-case,
>>
>> both would be fine according to BCP 47 - it is case insensitive.
>>
>>> * a country code is always in upper-case and respects ISO-3166-1
>>
>> see above.
>
> ok, but in the ISO lists, language codes are always lower-case and 
> country codes are always upper-case.

Sure, I was just mentioning what is likely to be checked by an BCP 47 
validator (I assume a lemon implementation would use existing code, see 
e.g. http://www.langtag.net/  ), since what you cite below says "...is 
RECOMMENDED" (= implementers are free to follow the recommendation or not).

- Felix

>
> And in http://tools.ietf.org/search/bcp47, section 2.1.1
>
> The ABNF syntax also does not distinguish between upper- and
>     lowercase: the uppercase US-ASCII letters in the range 'A' through
>     'Z' are always considered equivalent and mapped directly to their US-
>     ASCII lowercase equivalents in the range 'a' through 'z'.  So the tag
>     "I-AMI" is considered equivalent to that value "i-ami" in the
>     'irregular' production.
>
>     Although case distinctions do not carry meaning in language tags,
>     consistent formatting and presentation of language tags will aid
>     users.  The format of subtags in the registry is RECOMMENDED as the
>     form to use in language tags.  This format generally corresponds to
>     the common conventions for the various ISO standards from which the
>     subtags are derived.
>
>     These conventions include:
>
>     o  [ISO639-1  <http://tools.ietf.org/search/bcp47#ref-ISO639-1>] recommends that language codes be written in lowercase
>        ('mn' Mongolian).
>
>     o  [ISO15924  <http://tools.ietf.org/search/bcp47#ref-ISO15924>] recommends that script codes use lowercase with the
>        initial letter capitalized ('Cyrl' Cyrillic).
>
>     o  [ISO3166-1  <http://tools.ietf.org/search/bcp47#ref-ISO3166-1>] recommends that country codes be capitalized ('MN'
>        Mongolia).
>
>
>
>>
>>> * this is to allow combination like eng (when any detail is not 
>>> needed) but permits precisions like eng-US or eng-UK.
>>
>> eng-US is not a bcp 47 language tag, since bcp47 requires the use of 
>> a two letter code if available , see
>> http://tools.ietf.org/html/bcp47#section-2.2.1
>> " When languages have both an ISO 639-1 two-character code and a three-
>>    character code (assigned by ISO 639-2, ISO 639-3, or ISO 639-5), only
>>    the ISO 639-1 two-character code is defined in the IANA registry."
>
> You are right.
> Gil
>
>>
>> - Felix
>>
>>> * to follow ISO-639-3 to access to a larger range of values than 
>>> ISO-639-1
>>> * IMHO nobody follow ISO-639-2 nowadays (it was a sort of wrong trial)
>>> * ISO-639-6 is not used
>>>
>>> Hoping that helps,
>>> Gil
>>>
>>>
>>> Le 30/01/2014 08:44, Lars Borin a écrit :
>>>> Dear all,
>>>>
>>>>>
>>>>>
>>>>>     Other that that I wanted to clarify one issue regarding
>>>>>     language codes in the example.
>>>>>
>>>>>     I have seen that some people (John?) have started to use the
>>>>>     ISO 639-2 codes (e.g. "ENG" for English, "SPA" for Spanish etc.).
>>>>>     I would propose we stick to the ISO 639-1 two-letter ISO 639-1
>>>>>     codes (e.g. "EN", "ES") etc. There is no particular reason for
>>>>>     this other than the fact that most people know these codes.
>>>>>
>>>>>     If the argument is recency and reusing the newest standard,
>>>>>     then we would have to go anyway for four letter codes
>>>>>     according to ISO 639-6.
>>>>>
>>>>>
>>>>> In the open mulitlingual wordnet we use the three letter codes 
>>>>> because there are people working on languages which do not have 
>>>>> two letter codes, such as Abui (abz),  Minangkabau (min) or 
>>>>> Cantonese (yue).  Note that some of these are large language 
>>>>> communities, Minangkabauhas around 6 million speakers. I think 
>>>>> this is a strong argument for not going back to the two letter codes.
>>>>
>>>> I suspect that the three-letter codes in question are intended to 
>>>> be ISO 639-3 (and not 639-2), the use of which is pretty much best 
>>>> practice in linguistics today (even if there is quite a bit of 
>>>> discussion about how well it reflects lingusitic descriptive 
>>>> practice and actual reality; see, e.g., 
>>>> <http://dlc.hypotheses.org/610>), because of coverage (not even all 
>>>> the languages of Europe are covered by 639-1, e.g. the two Sorbian 
>>>> languages) and because of granularity: The "language" level of ISO 
>>>> 639-3 (basically that of the Ethnologue) will not be included in 
>>>> 639-6, so there won't be a way of saying "English", since 639-3 
>>>> already provides one, but you will be able to say (or, rather, 
>>>> propose codes for), e.g., "Elizabethan English", "Modern Australian 
>>>> English", etc.
>>>>
>>>> Best
>>>> Lars
>>>>
>>>> -- 
>>>> «Null hull,» sa Harry    | – Bögga? sagði Erlendur. Er það orð? |
>>>> (Jo Nesbø: Kakerlakkene) | (Arnaldur Indriðason: Mýrin)         |
>>>> --
>>>> Se aikainen matohan nokitaan!
>>>> (Reijo Mäki: Uhkapelimerkki)
>>>> ----
>>>> Lars Borin
>>>> Språkbanken • Centre for Language Technology
>>>> Institutionen för svenska språket
>>>> Göteborgs universitet
>>>> Box 200
>>>> SE-405 30 Göteborg
>>>> Sweden
>>>>
>>>> office +46 (0)31 786 4544
>>>> mobile +46 (0)70 747 8386
>>>>
>>>> <http://språkbanken.gu.se/personal/lars/>
>>>
>>
>
Received on Thursday, 30 January 2014 08:46:32 UTC