W3C home > Mailing lists > Public > www-international@w3.org > October to December 2004

Re: declaring language in html/xhtml

From: Tex Texin <tex@i18nguy.com>
Date: Sat, 11 Dec 2004 04:24:55 -0800
Message-ID: <41BAE717.1BE4BEE2@i18nguy.com>
To: Jon Hanna <jon@hackcraft.net>
CC: 'Alan Pierce' <apierce411@hotmail.com>, www-international@w3.org

Hi Jon,
Excellent answer.

I want to take issue with the first point though. I have heard the
recommendation for "ja" rather than ja-JP before as well.
I dislike it on several counts:

1) For languages in general, it is difficult for most of us to know whether a
language is spoken in different places with variations, or whether the
variations are significant enough to require a regional distinction. So until
someone publishes a guide listing languages and whether or not they require
distinction by region, so that we have a reference, for many of the folks who
need to assign the language tag, it is just a guess.

2) Being more specific when labeling content does no harm. (Assuming
heirarchical fallbacks.)

3) Being less specific introduces the risk of ambiguity, which may cause

4) Being less specific introduces the risk that even if the language alone is
adequate tagging today, it may not be tomorrow.
Language, legislation, external influences, and many other factors can cause a
region's speakers to change.

Supposing Japan legislates a spelling or sorting change, or a simplification of
the writing system, as has occurred with several languages in the past century.
The speakers outside of Japan may not adopt the changes.
Consider modern and traditional spanish, simplified and traditional chinese.
According to the ethnologue:
<Japanese is> spoken in 26 other countries including American Samoa, Argentina,
Australia, Belize, Brazil, Canada, Dominican Republic, Germany, Guam, Mexico,
Micronesia, Mongolia, New Zealand, Northern Mariana Islands, Palau, Panama,
Paraguay, Peru, Philippines.
There are also Japanese speakers in Taiwan.

At some point in the future, there may be value in distinguishing Japanese from
one or the other region.
If that occurs, then all of the data marked with just "ja" becomes ambiguous.

5) It is not clear to me that there is any benefit to using a shorter language
The recommendation comes from a spirit of keeping it as simple as possible. In
general, I support KISS.
But this is not simplifying an algorithm, this is subtracting information that
may be useful.

6) I realize the language tag may be supplied by the content author. I am sure
to get a comment to the effect that the fact that as a web administrator, or
localization manager, or in some other role, I do not know whether a language
has variations, the author will, since they are familiar with the language.
Well, I do not buy that.
I do believe they know where they were trained and can supply a region tag.

Just to be clear, I am not arguing that Japanese is different outside of Japan.
I am arguing that whether or not it is different somewhere in the world should
not be required knowledge when tagging content. The tagger should only need to
know whether their language is similar enough to Japan's Japanese to use a JP
region tag, or another one.

I know I am going against the grain here...


Jon Hanna wrote:
> > Does it make any practical difference to serve html with the html tag
> > marked-up as xhtml, like:
> > <html lang="ja-JP" xml:lang="ja_JP"
> > xmlns="http://www.w3.org/1999/xhtml">
> >
> > as opposed to simply
> > <html lang="ja-JP"> ?
> There's a few things here.
> 1. ja-JP means the dialect of Japanese spoken in Japan as opposed to the 1
> or more dialects spoken elsewhere. I've been told that there isn't any other
> country with a different form of Japanese, so the correct language tag is
> just "ja" unlike, for example British English "en-GB" which does benefit
> from the second part of the tag as it differentiates it from en-IE, en-US
> etc. (I don't know much about Japanese, but I've seen ja-JP used as an
> example of just this sort of mistake by those who do know more than I).
> 2. ja_JP is incorrect syntax, both lang and xml:lang take RFC 3066 tags so
> there are no underscores (a typo?).
> 3. The lang attribute is only in XHTML for backwards compatibility, so that
> when an old HTML tool that doesn't grok XHTML sees the XHTML it will act as
> if it is HTML and be able to determine the language. Contra this
> general-purpose XML tools that don't know anything specific about XHTML (and
> the ability to use such tools is the main practical advantage in using XHTML
> rather than HTML) will understand the xml:lang, but not the lang. As such
> xml:lang is the one that you must use, lang is the one that you can use as
> well.
> <html lang="ja">
> <!-- HTML 4.01 or earlier, Japanese -->
> <blah xml:lang="ja">
> <!-- Some form of XML, Japanese -->
> <html xml:lang="ja">
> <!-- Some form of XML, Japanese (Not XHTML, as there's no namespace) -->
> <html xml:lang="ja" xmlns="http://www.w3.org/1999/xhtml">
> <!-- XHTML, Japanese -->
> <html lang="ja" xmlns="http://www.w3.org/1999/xhtml">
> <!-- XHTML, Japanese, but general XML tools won't realise this. -->
> <html xml:lang="ja" lang="ja" xmlns="http://www.w3.org/1999/xhtml">
> <!-- XHTML, Japanese, backwards compatible with old HTML user-agents -->
> <html xml:lang="ja" lang="en" xmlns="http://www.w3.org/1999/xhtml">
> <!-- Obviously a bug, but the way it's interpreted is worth looking at.
> An XML tool will see it as Japanese.
> An HTML tool will see it as English.
> An XHTML tool will see xml:lang as over-riding lang, since lang is just for
> backwards-compatibility, and hence see it as being Japanese -->
> In all I'd recommend you keep using the fuller form until the general level
> of tool support means you can drop lang and just use xml:lang.
> Regards,
> Jon Hanna
> Work: <http://www.selkieweb.com/>
> Play: <http://www.hackcraft.net/>
> Chat: <irc://irc.freenode.net/selkie>

Tex Texin   cell: +1 781 789 1898   mailto:Tex@XenCraft.com
Xen Master                          http://www.i18nGuy.com
XenCraft		            http://www.XenCraft.com
Making e-Business Work Around the World
Received on Saturday, 11 December 2004 12:25:03 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 21 September 2016 22:37:24 UTC