W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: declaring language in html/xhtml

From: Tex Texin <tex@i18nguy.com>
Date: Sun, 12 Dec 2004 18:23:07 -0800
Message-ID: <41BCFD0B.85E1BE85@i18nguy.com>
To: Jon Hanna <jon@hackcraft.net>
CC: www-international@w3.org

Hi again,

1) My presumption is that the author knows the origin of their language, and
that if the author doesn't label the text then the tagger can ask the author.
If that's not possible, and if there isn't someone knowledgeable available to
identify the language, then I would agree with using the less specific tag.
I am not advocating being more specific than you know. Most authors know the
origins of their language.

2) Yes, heirarchical fallbacks can be problematic. It's a shame they work
differently between markup languages, java and unix. It's also possible that
the parent can be unsuitable as a replacement for the more specific value.
That said, although imperfect, I don't see that being more specific when the
values are equivalent is harmful, as is the case with ja and ja-JP.

3) You suggest that if you don't know, then you being too specific might cause
problems.
Perhaps. It works both ways. Being less specific can also be problematic, where
it makes en-GB, en-US, en-ZA, en-SG, etc. all equivalent and making an
appropriate selection impossible.

Since most of us don't know which tags should be just language and which should
be language-country, maybe there should be a list posted as to which are which.
Most of our disagreement is about what to do when it is unclear.
I would welcome some clarification and the issue then goes away. If we have a
list and identify Japanese, Greek, Hebrew and others as not requiring 2 levels
of identification, and others which do (perhaps also distinguished by regions
of note and others which can be lumped together), then we don't need to debate
more general guidelines.
(Yes, I know such a list won't be perfect either. But it would probably cover
the most frequently used cases.)

4) Yes, futures can go either way, but with the additional specificity and the
approximate date of origin, you have enough info to make an educated guess.

5) You say:
> ja-JP means "the
> dialect of Japanese that is spoken in Japan and which differs from other
> dialects spoken in other countries"

You will have to point me at where the standard says it must differ.
As far as I know it just says Japanese as spoken in Japan. It doesn't require
being different from others.

6) I am sorry I don't understand your reply to 6. I wasn't discussing UI, just
language tagging.

I think probably putting our energies into this debate is going to be less
productive than if we (and others) were to create a list suggesting which
languages are adequately described by a simple language-only tag, and which
require more.

If I am sent suggestions (privately preferred), I will create a page listing
them.
I'll post the same suggestion to the ietf-lang list where they are discussing
the next update of 3066.
tex

Jon Hanna wrote:
> 
> > I want to take issue with the first point though. I have heard the
> > recommendation for "ja" rather than ja-JP before as well.
> > I dislike it on several counts:
> >
> > 1) For languages in general, it is difficult for most of us
> > to know whether a
> > language is spoken in different places with variations, or whether the
> > variations are significant enough to require a regional
> > distinction. So until
> > someone publishes a guide listing languages and whether or
> > not they require
> > distinction by region, so that we have a reference, for many
> > of the folks who
> > need to assign the language tag, it is just a guess.
> 
> If you don't know then it's probably better to guess that only the ISO 639
> portion is relevant or applicable. Consider someone who didn't know much
> about English marking up text written by us. Having on limited knowledge of
> the language they wouldn't be able to identify dialects, never mind
> determine whether those dialects were linked to a particular country. They
> might guess based on the country the person in question was in but that can
> be a poor indicator (if I moved to Canada tomorrow my speech and writing
> would take some time to move away from en-IE). As such they would be best to
> use "en".
> 
> > 2) Being more specific when labeling content does no harm. (Assuming
> > heirarchical fallbacks.)
> 
> Hierarchical fallbacks are problematic in some cases.
> 
> > 3) Being less specific introduces the risk of ambiguity,
> > which may cause
> > problems.
> 
> This is only the case where ja-JP does really differentiate from, say, a
> hypothetical dialect of Japanese spoken elsewhere. Contra this, if you don't
> *know* that a more specific tag is appropriate then you may be in fact
> incorrect in using the more specific term (i.e. if you marked my writing as
> en-GB to be more specific than en based on knowledge of differences in en-US
> and en-GB that would be incorrect, and you would have been better being less
> precise).
> 
> > 4) Being less specific introduces the risk that even if the
> > language alone is
> > adequate tagging today, it may not be tomorrow.
> > Language, legislation, external influences, and many other
> > factors can cause a
> > region's speakers to change.
> >
> > Supposing Japan legislates a spelling or sorting change, or a
> > simplification of
> > the writing system, as has occurred with several languages in
> > the past century.
> > The speakers outside of Japan may not adopt the changes.
> > Consider modern and traditional spanish, simplified and
> > traditional chinese.
> > According to the ethnologue:
> > <Japanese is> spoken in 26 other countries including American
> > Samoa, Argentina,
> > Australia, Belize, Brazil, Canada, Dominican Republic,
> > Germany, Guam, Mexico,
> > Micronesia, Mongolia, New Zealand, Northern Mariana Islands,
> > Palau, Panama,
> > Paraguay, Peru, Philippines.
> > There are also Japanese speakers in Taiwan.
> >
> > At some point in the future, there may be value in
> > distinguishing Japanese from
> > one or the other region.
> > If that occurs, then all of the data marked with just "ja"
> > becomes ambiguous.
> 
> In such a case the data *should* be ambiguous. We don't know where the
> language used fits into the range of Japanese dialects that may or may not
> exist at any given point in the future. If someone in the year 3043 comes
> across the data they're just going to have to work it out for themselves,
> there's nothing we can do to help. Indeed we could be marking data as ja-JP
> that is closer to ja-BR than ja-JP at some point in the future where those
> two have become distinct dialects.
> 
> > 5) It is not clear to me that there is any benefit to using a
> > shorter language
> > tag.
> > The recommendation comes from a spirit of keeping it as
> > simple as possible. In
> > general, I support KISS.
> > But this is not simplifying an algorithm, this is subtracting
> > information that
> > may be useful.
> 
> It's subtracting information that simply isn't there. ja-JP means "the
> dialect of Japanese that is spoken in Japan and which differs from other
> dialects spoken in other countries". Unless those other dialects exist (I've
> been told that Japanese spoken outside of Japan is too close to how it is
> spoken in Japan to be considered a dialect, though I admit I've no knowledge
> of this myself) then there simply isn't any such language as ja-JP.
> 
> > 6) I realize the language tag may be supplied by the content
> > author. I am sure
> > to get a comment to the effect that the fact that as a web
> > administrator, or
> > localization manager, or in some other role, I do not know
> > whether a language
> > has variations, the author will, since they are familiar with
> > the language.
> > Well, I do not buy that.
> > I do believe they know where they were trained and can supply
> > a region tag.
> 
> No, I don't buy that either. Tags should not be UI features.
> 
> > Just to be clear, I am not arguing that Japanese is different
> > outside of Japan.
> 
> If someone was to turn around and say that actually it is I wouldn't be
> amazed.
> 
> > I am arguing that whether or not it is different somewhere in
> > the world should
> > not be required knowledge when tagging content. The tagger
> > should only need to
> > know whether their language is similar enough to Japan's
> > Japanese to use a JP
> > region tag, or another one.
> 
> Content providers should have been told "write what you know" a long time
> ago. "Tag what you know" isn't that much further :)
> 
> Regards,
> Jon Hanna
> Work: <http://www.selkieweb.com/>
> Play: <http://www.hackcraft.net/>
> Chat: <irc://irc.freenode.net/selkie>

-- 
-------------------------------------------------------------
Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
                         
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
-------------------------------------------------------------
Received on Monday, 13 December 2004 02:23:21 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:04 GMT