Re: XML Core WG needs input on xml:lang=""

At 08:46 AM 2002-08-02, John Cowan wrote:

>The W3C XML Core WG has decided to allow the value of xml:lang, the
>attribute for indicating the natural language of character data, to
>be an empty string in order to allow the explicit expression of
>language-less text inside language-marked text.  Here's an example:
>
><p lang="en">
>  Here is an example of some C code:
>  <pre xml:lang="">
>     #include "stdio.h"
>     main() {printf("Hello world!"};}
>  </pre>
></p>
>
>By the present rules, there is no way to express the fact that the
>content of the pre element is not in English.  (Computer languages are out
>of scope for RFC 3066 and have no codes.)
>
>However, the WG is divided on the question of whether to issue an
>erratum to XML 1.0 or to make this provision part of XML 1.1.
>
>Argument for XML 1.1:  It is a new feature and as such belongs in XML 1.1,
>which we are conveniently issuing shortly anyway.
>
>Argument for erratum:  It is just a single new allowed value for an attribute
>that already got a whole lot of new values when we upgraded (by existing
>erratum E11) from the obsolete RFC 1766 to the current RFC 3066.
>For example, "haw" was an illegal tag under 1766, but refers to the
>Hawai'ian language now.
>
>Note: The XML Schema Datatypes document still references the obsolete RFC,
>but defers to XML 1.0 2e for the exact rules, so an erratum would immediately
>allow the empty string in objects of type xsd:language; an XML 1.1
>change would not immediately allow it.
>
>Note: Any application that processes xml:lang has to already be prepared
>for thousands of legal values, most of which it will not understand.
>For example, de-jp is legal, symbolizing the variety of German spoken and
>written in Japan, whatever that might be.
>
>Note: The existing code "und" is not synonymous with the proposed use of the
>empty string.  The "und" code means that the text is in some natural language,
>but we don't know which one; the empty string means that the text is not
>in a natural language.

This assertion is fatuous.  Un-enforceably vague.

The 'und' mark at least is well posed, if it means "one of the defined
language labels applies, but we don't know which."  This is a union type.

Distinguishing between 

a) a natural language for which there is no label registered

b) "not a natural language"

has no portable definition among different agents applying 'lang' attribute
values, and hence should not be presumed known by these agents.

It would be fine to have a 'noneOfTheAbove' value for the 'lang' attribute.

However, for practical purposes a 'nil' on 'lang' inside a natural-language
context will be sufficient to disabuse the processor of following the rules
of the natural language in the enclosing scope.

Process question --

who defines the 'und' token?  Is this a meta-value defined in the IETF RFC,
or is this an invention of XSD Types or of XML?

Introducing a 'nil' compatible with the use thereof in XQuery would be
a suitable erratum if this is not already allowed.

Introducing the suggested sense for the null string would appear to be a bad
idea on the grounds that the sense bound to this sign is ill-posed, not
interoperable.  So don't go there.

Al



>Disclosure: I personally favor issuing an erratum.
>
>Please send public comments on the question "erratum vs. XML 1.1" to
>xml-editor@w3.org, which is also copied on this mail.
>W3C-confidential comments may be sent to w3c-xml-core-wg@w3.org, which
>is also copied on this mail.
>
>-- 
>John Cowan                              <jcowan@reutershealth.com>
>http://www.ccil.org/~cowan              http://www.reutershealth.com
>Unified Gaelic in Cyrillic script!
>        http://groups.yahoo.com/group/Celticonlang

Received on Friday, 2 August 2002 09:30:00 UTC