W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

RE: declaring language in html/xhtml

From: Jon Hanna <jon@hackcraft.net>
Date: Sat, 11 Dec 2004 15:31:05 -0000
To: "'Tex Texin'" <tex@i18nguy.com>, <www-international@w3.org>
Message-Id: <20041211153118.40B262DC9408D@postie.hosting365.ie>

> I want to take issue with the first point though. I have heard the
> recommendation for "ja" rather than ja-JP before as well.
> I dislike it on several counts:
> 
> 1) For languages in general, it is difficult for most of us 
> to know whether a
> language is spoken in different places with variations, or whether the
> variations are significant enough to require a regional 
> distinction. So until
> someone publishes a guide listing languages and whether or 
> not they require
> distinction by region, so that we have a reference, for many 
> of the folks who
> need to assign the language tag, it is just a guess.

If you don't know then it's probably better to guess that only the ISO 639
portion is relevant or applicable. Consider someone who didn't know much
about English marking up text written by us. Having on limited knowledge of
the language they wouldn't be able to identify dialects, never mind
determine whether those dialects were linked to a particular country. They
might guess based on the country the person in question was in but that can
be a poor indicator (if I moved to Canada tomorrow my speech and writing
would take some time to move away from en-IE). As such they would be best to
use "en".

> 2) Being more specific when labeling content does no harm. (Assuming
> heirarchical fallbacks.)

Hierarchical fallbacks are problematic in some cases.

> 3) Being less specific introduces the risk of ambiguity, 
> which may cause
> problems.

This is only the case where ja-JP does really differentiate from, say, a
hypothetical dialect of Japanese spoken elsewhere. Contra this, if you don't
*know* that a more specific tag is appropriate then you may be in fact
incorrect in using the more specific term (i.e. if you marked my writing as
en-GB to be more specific than en based on knowledge of differences in en-US
and en-GB that would be incorrect, and you would have been better being less
precise).

> 4) Being less specific introduces the risk that even if the 
> language alone is
> adequate tagging today, it may not be tomorrow.
> Language, legislation, external influences, and many other 
> factors can cause a
> region's speakers to change.
> 
> Supposing Japan legislates a spelling or sorting change, or a 
> simplification of
> the writing system, as has occurred with several languages in 
> the past century.
> The speakers outside of Japan may not adopt the changes.
> Consider modern and traditional spanish, simplified and 
> traditional chinese.
> According to the ethnologue:
> <Japanese is> spoken in 26 other countries including American 
> Samoa, Argentina,
> Australia, Belize, Brazil, Canada, Dominican Republic, 
> Germany, Guam, Mexico,
> Micronesia, Mongolia, New Zealand, Northern Mariana Islands, 
> Palau, Panama,
> Paraguay, Peru, Philippines.
> There are also Japanese speakers in Taiwan.
> 
> At some point in the future, there may be value in 
> distinguishing Japanese from
> one or the other region.
> If that occurs, then all of the data marked with just "ja" 
> becomes ambiguous.

In such a case the data *should* be ambiguous. We don't know where the
language used fits into the range of Japanese dialects that may or may not
exist at any given point in the future. If someone in the year 3043 comes
across the data they're just going to have to work it out for themselves,
there's nothing we can do to help. Indeed we could be marking data as ja-JP
that is closer to ja-BR than ja-JP at some point in the future where those
two have become distinct dialects.

> 5) It is not clear to me that there is any benefit to using a 
> shorter language
> tag.
> The recommendation comes from a spirit of keeping it as 
> simple as possible. In
> general, I support KISS.
> But this is not simplifying an algorithm, this is subtracting 
> information that
> may be useful.

It's subtracting information that simply isn't there. ja-JP means "the
dialect of Japanese that is spoken in Japan and which differs from other
dialects spoken in other countries". Unless those other dialects exist (I've
been told that Japanese spoken outside of Japan is too close to how it is
spoken in Japan to be considered a dialect, though I admit I've no knowledge
of this myself) then there simply isn't any such language as ja-JP.

> 6) I realize the language tag may be supplied by the content 
> author. I am sure
> to get a comment to the effect that the fact that as a web 
> administrator, or
> localization manager, or in some other role, I do not know 
> whether a language
> has variations, the author will, since they are familiar with 
> the language.
> Well, I do not buy that.
> I do believe they know where they were trained and can supply 
> a region tag.

No, I don't buy that either. Tags should not be UI features.

> Just to be clear, I am not arguing that Japanese is different 
> outside of Japan.

If someone was to turn around and say that actually it is I wouldn't be
amazed.

> I am arguing that whether or not it is different somewhere in 
> the world should
> not be required knowledge when tagging content. The tagger 
> should only need to
> know whether their language is similar enough to Japan's 
> Japanese to use a JP
> region tag, or another one.

Content providers should have been told "write what you know" a long time
ago. "Tag what you know" isn't that much further :)

Regards,
Jon Hanna
Work: <http://www.selkieweb.com/>
Play: <http://www.hackcraft.net/>
Chat: <irc://irc.freenode.net/selkie> 
Received on Saturday, 11 December 2004 15:31:25 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT