[Bug 14709] user agent lang tag handling is insufficiently specified

http://www.w3.org/Bugs/Public/show_bug.cgi?id=14709

--- Comment #28 from Glenn Adams <glenn@skynav.com> 2011-11-09 19:36:31 UTC ---
(In reply to comment #27)
> Like Glenn said, there is a question what "null lang subtag" means: It could
> not be equal to the empty string. Let's consider a spelling checker: how should
> it behave in case it saw this:

Presumably your reasoning for why it  (null lang subtag) could not be equal to
the empty string is based on the point that the empty string is not a valid
BCP47 tag. Is this correct?

Looking back at HTML4.0 [1], I see that lang was defined to be an RFC1766
Language-Tag [2], which, to be well formed, must consist of at least one
character (in the Primary-tag) [3][4]. There is no discussion in HTML4.0 or
RFC1766 about a default "unknown" or "undetermined" language.

[1] http://www.w3.org/TR/1998/REC-html40-19980424/
[2] http://www.ietf.org/rfc/rfc1766.txt
[3] http://www.w3.org/TR/1998/REC-html40-19980424/struct/dirlang.html#langcodes
[4] http://www.w3.org/TR/1998/REC-html40-19980424/types.html#h-6.8
[5] http://www.w3.org/TR/1998/REC-html40-19980424/struct/dirlang.html#h-8.1.3

HTML4.0 also defines semantics for inheritance of language [6], wherein the
language that applies to a parent element is inherited by its child elements
unless the child specifies a language attribute.

[6] http://www.w3.org/TR/1998/REC-html40-19980424/struct/dirlang.html#h-8.1.2

HTML4.0 does NOT specify a means for a child to block inheritance except by
specifying a valid RFC1766 language in its lang attribute. That is, HTML4.0
does not define the use of the empty string (or any other value) as a way to
reset the child's language to "unknown" or "undetermined" or "default".

Notwithstanding the above, the language tag "i-default" was registered with
IANA in March 1998 [7], making it a valid language tag that means 'default'
language. This tag is also included in BCP47 as a valid grandfathered tag.

[7] http://www.iana.org/assignments/lang-tags/i-default

Curiously, 'i-default' is defined in terms of the recipient's language
preferences, and not in terms of the language of the message being transmitted:

"It is not a specific language, but rather identifies the condition where the
language preferences of the user cannot be established."

Furthermore, it is required that:

"Messages in Default Language MUST be understandable by an English-speaking
person..."

In essence, 'i-default' is like a weak form of 'en'.

My conclusion is that 'i-default' is NOT the same as stating that the language
of the marked content is unknown or undetermined. So it should not be used for
this purpose.

XML 1.0 1998 [1st Edition] also defines xml:lang [8] in terms of RFC1766, and
does not mention a default or unknown/undetermined language value, and does NOT
specify the use of the empty string as a way of denoting a default or unknown
language value.

[8] http://www.w3.org/TR/1998/REC-xml-19980210#sec-lang-tag

Subsequently, in XML 1.0 2004 [3rd Edition] [9], the use of RFC1766 is updated
to the use of RFC3066 [10] AND the null / empty string is introduced as a legal
value [11]:

"The values of the attribute are language identifiers as defined by [IETF RFC
3066], Tags for the Identification of Languages, or its successor; in addition,
the empty string may be specified."

and

"The intent declared with xml:lang is considered to apply to all attributes and
content of the element where it is specified, unless overridden with an
instance of xml:lang on another element within that content. In particular, the
empty value of xml:lang is used on an element B to override a specification of
xml:lang on an enclosing element A, without specifying another language. Within
B, it is considered that there is no language information available, just as if
xml:lang had not been specified on B or any of its ancestors."

[9] http://www.w3.org/TR/2004/REC-xml-20040204/
[10] http://www.ietf.org/rfc/rfc3066.txt
[11] http://www.w3.org/TR/2004/REC-xml-20040204/#sec-lang-tag

The last paragraph quoted above is expanded in XML 1.0 2006 [4th Edition] [12]
to read as:

"The language specified by xml:lang applies to the element where it is
specified (including the values of its attributes), and to all elements in its
content unless overridden with another instance of xml:lang. In particular, the
empty value of xml:lang is used on an element B to override a specification of
xml:lang on an enclosing element A, without specifying another language. Within
B, it is considered that there is no language information available, just as if
xml:lang had not been specified on B or any of its ancestors. Applications
determine which of an element's attribute values and which parts of its
character content, if any, are treated as language-dependent values described
by xml:lang."

[12] http://www.w3.org/TR/2006/REC-xml-20060816/#sec-lang-tag

This language remains unchanged in the current XML 1.0 2008 [5th Edition] [13].

[13] http://www.w3.org/TR/REC-xml/#sec-lang-tag

> One primary language subtags in the language subtag registry that means
> something close to "null", is 'und' (Undtermined). So one option could perhaps
> be to convert illegal primary language subtags to that subtag - 'und'?

To be consistent with XML 1.0 3rd Edition and later, we need to use the empty
(null) string to both (1) specify the absence of language information and (2)
override inheritance of language information from the parent.

For invalid language tags, I would now conclude that it should have the same
treatment, i.e., be treated as if the empty string had been specified.

Note that a language tag may be valid according to BCP47 but not listed in the
IANA registry. This is due to the possible use of privateuse subtags.

So given the above, I would now propose the language of HTML5 be changed as
follows:

In 3.2.3.3

In 1st paragraph, remove last sentence (this gets moved to 13 paragraph
described below):

"Setting the attribute to the empty string indicates that the primary language
is unknown."

In 11th paragraph, change

"If the resulting value is not a recognized language tag, then it must be
treated as an unknown language having the given language tag, distinct from all
other languages. For the purposes of round-tripping or communicating with other
services that expect language tags, user agents should pass unknown language
tags through unmodified."

to read as:

"If the resulting value is non-empty and is not valid according to BCP47
ยง2.2.9, then it must be treated as if the empty string had been specified."

Remove 12th paragraph starting with "Thus, for instance, an element with
lang="xyzzy" ..."

In 13th paragraph, change:

"If the resulting value is the empty string, then it must be interpreted as
meaning that the language of the node is explicitly unknown."

to read:

"If the resulting value is the empty string, then it must be interpreted as
meaning no language information is available, just as if the lang attribute had
not been specified on the element or any of its ancestors."

-- 
Configure bugmail: http://www.w3.org/Bugs/Public/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the QA contact for the bug.

Received on Wednesday, 9 November 2011 19:36:40 UTC