Re: XML Core WG needs input on xml:lang="" from Chris Lilley on 2002-08-02 (xml-editor@w3.org from July to September 2002)

From: Chris Lilley <chris@w3.org>
Date: Fri, 2 Aug 2002 16:43:24 +0200
To: w3c-xml-plenary@w3.org, John Cowan <jcowan@reutershealth.com>
CC: w3c-i18n-ig@w3.org, xml-editor@w3.org, w3c-xml-core-wg@w3.org
Message-ID: <29198593.20020802164324@w3.org>
On Friday, August 2, 2002, 2:46:57 PM, John wrote:


JC> The W3C XML Core WG has decided to allow the value of xml:lang, the
JC> attribute for indicating the natural language of character data, to
JC> be an empty string in order to allow the explicit expression of
JC> language-less text inside language-marked text.  Here's an example:

JC> <p lang="en">
JC>   Here is an example of some C code:
JC>   <pre xml:lang="">
JC>      #include "stdio.h"
JC>      main() {printf("Hello world!"};}
JC>   </pre>
JC> </p>

JC> By the present rules, there is no way to express the fact that the
JC> content of the pre element is not in English.  (Computer languages are out
JC> of scope for RFC 3066 and have no codes.)

This is a compelling example (it could, for example be used by
authoring tools to disable spell-checking on selected subtrees of the
document).

JC> However, the WG is divided on the question of whether to issue an
JC> erratum to XML 1.0 or to make this provision part of XML 1.1.

JC> Argument for XML 1.1:  It is a new feature and as such belongs in XML 1.1,
JC> which we are conveniently issuing shortly anyway.

Its not clear that it is a new feature.

JC> Argument for erratum:  It is just a single new allowed value for an attribute
JC> that already got a whole lot of new values when we upgraded (by existing
JC> erratum E11) from the obsolete RFC 1766 to the current RFC 3066.
JC> For example, "haw" was an illegal tag under 1766, but refers to the
JC> Hawai'ian language now.

That is a strong argument.

JC> Note: The XML Schema Datatypes document still references the obsolete RFC,
JC> but defers to XML 1.0 2e for the exact rules, so an erratum would immediately
JC> allow the empty string in objects of type xsd:language; an XML 1.1
JC> change would not immediately allow it.

I would consider that a benefit of the erratum route.

JC> Note: Any application that processes xml:lang has to already be prepared
JC> for thousands of legal values, most of which it will not understand.

'understand' is a bit misleading. Many of the values might not trigger
any special action, true. It will 'understand' them in the sense that
an editor that has british english, american english, canadian french
and french french dictionaries 'understands' what to do when
spellchecking a subtree with xml:lang="ja-jp"  and hopefully
understands what to do, otr at least what choices to present when
prompting the user, if it comes across xml:lang="fr" or
xml:lang="fr-iw".

JC> For example, de-jp is legal, symbolizing the variety of German spoken and
JC> written in Japan, whatever that might be.

Yes (I believe it is closely related to pers.martin-duerst) but the
crucial point here is that the processor does not have tto know
anything about the soocial demographics of the german-speaking
japanese population, and does not have to know that de means german,
in fact. It has to know that de-jp is a subtype of de and thus, if it
is a server with a resource of language de-jp and a request comes in
with Accept-language: de then it is an acceptable resource wheras
Accept-language: ja will generate a 'none acceptable' HTTP response.

In other words, processing consists of string matching on a
hierarchical set of hyphen-separated tokens, with zero understanding
involved.

JC> Note: The existing code "und" is not synonymous with the proposed use of the
JC> empty string.  The "und" code means that the text is in some natural language,
JC> but we don't know which one; the empty string means that the text is not
JC> in a natural language.

Aha. The last part of your sentence means this is a rather different
proposal than I had thought.

A question. Is

<foo/>

thus equivalent to

<foo xml:lang="und"/>

and not equivalent to

<foo xml:lang=""/>

In other words, what is asserted by the absence of xml:lang on the
root element?  Is it an absence of information or is it some form of
positive assertion?  I would suggest that it is an absence of
information. For example, a program that pulls text from a
multilingual database, or accepts human input, and makes little xml
instances containing this text. The program does not know what
language is, so it says nothing. This is not the same as the text
being in an unknown language.

Is "" appropriate for "undeclaring" a previously declared language?
Would "nal" or somesuch (by analogy with NaN for numbers) not be more
appropriate for non-human languages? You could then declare the value
of xml:lang to be "" or "xml:nal" or "an RFC 3066 code" and keep "" to
mean "undeclare" rather than "declare a specific thing". This would
also, I think, be more consistent with XML namespaces 1.1 use of "".

JC> Disclosure: I personally favor issuing an erratum.

On balance, so do I but I would like a little more clarity on the
semantics of "". The example that started your post was compelling but
perhaps misleading. I at first took it to mean that English was being
undeclared. Instead, it is saying that the contents are in a non-human
or non-natural language.

Lets consider this example and discuss what value of xml:lang is
suitable on the 'artefact' element:

<archeologicalReport>
 <abstract xml:lang="en">
  <para>During excavations, a stone was found with writings in a
  previously unknown language:
    <artefact>Zibble forg</artefact>
  </para>
 </abstract>
 <abstract xml:lang="fr">
  <para>Pendant des fouilles, une pierre a été trouvée avec
    des écritures dans une langue précédemment inconnue :
    <artefact>Zibble forg</artefact>
  </para>
 </abstract>
</archeologicalReport>

The text on the stone is in a human language but we don't know which
one. The example above erroneously (by inheritance) labels it as being
in english, and a second copy as being in french. So xml:lang needs to
be set on both 'artefact' elements.

Would "und" or "" be the appropriate choice here?

Second question, for the root element - it has no text content and two
children in different languages. Would "und" be appropriate here?
Doesn't seem like it - the two languages of the content of the element
are both known. Is "" apropriate? Seems not either

JC> Please send public comments on the question "erratum vs. XML 1.1" to
JC> xml-editor@w3.org, which is also copied on this mail.
JC> W3C-confidential comments may be sent to w3c-xml-core-wg@w3.org, which
JC> is also copied on this mail.

These comments are public but were copied to xml core anyway for their
convenience.


-- 
 Chris                            mailto:chris@w3.org
Received on Friday, 2 August 2002 10:44:06 UTC