- From: Chris Lilley <chris@w3.org>
- Date: Fri, 2 Aug 2002 16:43:24 +0200
- To: w3c-xml-plenary@w3.org, John Cowan <jcowan@reutershealth.com>
- CC: w3c-i18n-ig@w3.org, xml-editor@w3.org, w3c-xml-core-wg@w3.org
On Friday, August 2, 2002, 2:46:57 PM, John wrote: JC> The W3C XML Core WG has decided to allow the value of xml:lang, the JC> attribute for indicating the natural language of character data, to JC> be an empty string in order to allow the explicit expression of JC> language-less text inside language-marked text. Here's an example: JC> <p lang="en"> JC> Here is an example of some C code: JC> <pre xml:lang=""> JC> #include "stdio.h" JC> main() {printf("Hello world!"};} JC> </pre> JC> </p> JC> By the present rules, there is no way to express the fact that the JC> content of the pre element is not in English. (Computer languages are out JC> of scope for RFC 3066 and have no codes.) This is a compelling example (it could, for example be used by authoring tools to disable spell-checking on selected subtrees of the document). JC> However, the WG is divided on the question of whether to issue an JC> erratum to XML 1.0 or to make this provision part of XML 1.1. JC> Argument for XML 1.1: It is a new feature and as such belongs in XML 1.1, JC> which we are conveniently issuing shortly anyway. Its not clear that it is a new feature. JC> Argument for erratum: It is just a single new allowed value for an attribute JC> that already got a whole lot of new values when we upgraded (by existing JC> erratum E11) from the obsolete RFC 1766 to the current RFC 3066. JC> For example, "haw" was an illegal tag under 1766, but refers to the JC> Hawai'ian language now. That is a strong argument. JC> Note: The XML Schema Datatypes document still references the obsolete RFC, JC> but defers to XML 1.0 2e for the exact rules, so an erratum would immediately JC> allow the empty string in objects of type xsd:language; an XML 1.1 JC> change would not immediately allow it. I would consider that a benefit of the erratum route. JC> Note: Any application that processes xml:lang has to already be prepared JC> for thousands of legal values, most of which it will not understand. 'understand' is a bit misleading. Many of the values might not trigger any special action, true. It will 'understand' them in the sense that an editor that has british english, american english, canadian french and french french dictionaries 'understands' what to do when spellchecking a subtree with xml:lang="ja-jp" and hopefully understands what to do, otr at least what choices to present when prompting the user, if it comes across xml:lang="fr" or xml:lang="fr-iw". JC> For example, de-jp is legal, symbolizing the variety of German spoken and JC> written in Japan, whatever that might be. Yes (I believe it is closely related to pers.martin-duerst) but the crucial point here is that the processor does not have tto know anything about the soocial demographics of the german-speaking japanese population, and does not have to know that de means german, in fact. It has to know that de-jp is a subtype of de and thus, if it is a server with a resource of language de-jp and a request comes in with Accept-language: de then it is an acceptable resource wheras Accept-language: ja will generate a 'none acceptable' HTTP response. In other words, processing consists of string matching on a hierarchical set of hyphen-separated tokens, with zero understanding involved. JC> Note: The existing code "und" is not synonymous with the proposed use of the JC> empty string. The "und" code means that the text is in some natural language, JC> but we don't know which one; the empty string means that the text is not JC> in a natural language. Aha. The last part of your sentence means this is a rather different proposal than I had thought. A question. Is <foo/> thus equivalent to <foo xml:lang="und"/> and not equivalent to <foo xml:lang=""/> In other words, what is asserted by the absence of xml:lang on the root element? Is it an absence of information or is it some form of positive assertion? I would suggest that it is an absence of information. For example, a program that pulls text from a multilingual database, or accepts human input, and makes little xml instances containing this text. The program does not know what language is, so it says nothing. This is not the same as the text being in an unknown language. Is "" appropriate for "undeclaring" a previously declared language? Would "nal" or somesuch (by analogy with NaN for numbers) not be more appropriate for non-human languages? You could then declare the value of xml:lang to be "" or "xml:nal" or "an RFC 3066 code" and keep "" to mean "undeclare" rather than "declare a specific thing". This would also, I think, be more consistent with XML namespaces 1.1 use of "". JC> Disclosure: I personally favor issuing an erratum. On balance, so do I but I would like a little more clarity on the semantics of "". The example that started your post was compelling but perhaps misleading. I at first took it to mean that English was being undeclared. Instead, it is saying that the contents are in a non-human or non-natural language. Lets consider this example and discuss what value of xml:lang is suitable on the 'artefact' element: <archeologicalReport> <abstract xml:lang="en"> <para>During excavations, a stone was found with writings in a previously unknown language: <artefact>Zibble forg</artefact> </para> </abstract> <abstract xml:lang="fr"> <para>Pendant des fouilles, une pierre a été trouvée avec des écritures dans une langue précédemment inconnue : <artefact>Zibble forg</artefact> </para> </abstract> </archeologicalReport> The text on the stone is in a human language but we don't know which one. The example above erroneously (by inheritance) labels it as being in english, and a second copy as being in french. So xml:lang needs to be set on both 'artefact' elements. Would "und" or "" be the appropriate choice here? Second question, for the root element - it has no text content and two children in different languages. Would "und" be appropriate here? Doesn't seem like it - the two languages of the content of the element are both known. Is "" apropriate? Seems not either JC> Please send public comments on the question "erratum vs. XML 1.1" to JC> xml-editor@w3.org, which is also copied on this mail. JC> W3C-confidential comments may be sent to w3c-xml-core-wg@w3.org, which JC> is also copied on this mail. These comments are public but were copied to xml core anyway for their convenience. -- Chris mailto:chris@w3.org
Received on Friday, 2 August 2002 10:44:06 UTC