Re: Updated article: Two-letter or three-letter language codes

1. Details.

>On the other hand, in the list, there are a few items (such as en-us,

> >pt-br) that look perfectly fine. What was wrong with them?
>
> Capitalization.


No. (trying for equal brevity)

More completely:

102     0.015999%       en-us.
122     0.010068%       en-us

As I said in my original message, the second one has a space at the end. The
first one has a period at the end. So both are ill-formed. The same is true
of pt-br (extra space at end).

2 Correction

A correction: this is actually Accept-Language values, not documents -- we
get different results looking at documents. A very interesting point,
however, is that the errors here could be corrected if the browsers checked
for well-formedness, or at least partial well-formedness, when allowing the
user to pick his/er browser's language. That would eliminate a lot of cruft.

3. Guidance for User Agents. This raises a point we should probably have
language in 4646bis for. Here's rough text for it; I anticipate that this
will generate some discussion ;-)

When a user agent, such as a browser, allows users to enter a language tag
by typing, the results SHOULD be checked for well-formedness. If the user
agent is not regularly updated to the latest registry, it SHOULD NOT require
validity, because that could exclude current, valid language tags. It is
recommended, however, that the user be notified that the language tag may
not be valid.

4. Basic Well-Formedness. We may also want to have the notion of basic
well-formedness, which that part of validity which can be checked with a
regular expression. The difference is that basic well-formedness doesn't
check for multiple singleton extensions. The value of doing this is that (a)
it covers 99.999...% of the value of a well-formedness check, and (b) it is
a much easier sell to implementers that all they need is a simple regex
check.

Mark

Received on Sunday, 24 September 2006 20:15:10 UTC