W3C home > Mailing lists > Public > www-international@w3.org > April to June 2007

RE: [Ltru] RE: For review: Tagging text with no language

From: Peter Constable <petercon@microsoft.com>
Date: Thu, 12 Apr 2007 22:41:45 -0700
To: Stephen Deach <sdeach@adobe.com>, Mark Davis <mark.davis@icu-project.org>
CC: Kent Karlsson <kent.karlsson14@comhem.se>, Asmus Freytag <asmusf@ix.netcom.com>, John Cowan <cowan@ccil.org>, Richard Ishida <ishida@w3.org>, LTRU Working Group <ltru@ietf.org>, "www-international@w3.org" <www-international@w3.org>, CLDR list <cldr@unicode.org>
Message-ID: <DDB6DE6E9D27DD478AE6D1BBBB8357955D038B87DD@NA-EXMSG-C117.redmond.corp.microsoft.com>
ISO 639 indicates that programming languages are out of scope. I interpret that to mean that no programming language or group of programming languages is positively represented - i.e. none of the entities represented (either individually or as part of a group) is a programming language. I interpret "zxx" to mean "the content so tagged is not any instance of the kind of entities encompassed by this coding standard". That would entail that "zxx" could appropriately be applied to content that is in a programming language with as much appropriateness as applying it to a part number or random text or an empty file or telemetry from a space probe: you may or may not be able to interpret the content, but you certainly cannot interpret it in terms of any human language.


Peter

________________________________
From: www-international-request@w3.org [mailto:www-international-request@w3.org] On Behalf Of Stephen Deach
Sent: Thursday, April 12, 2007 8:23 PM
To: Mark Davis; Stephen Deach
Cc: Kent Karlsson; Asmus Freytag; John Cowan; Richard Ishida; LTRU Working Group; www-international@w3.org; CLDR list
Subject: Re: [Ltru] RE: For review: Tagging text with no language

My point was intended to be that this started out as a debate on the interpretation of "". I don't see that any existing tag other than "und" makes any sense as an alternate interpretation of "", so why the long debate and the discussion of the various re-translations of what "und" means in Danish, Swedish, German, Italian, et al. (At the nuance level of this discussion, such translations are ALWAYS imperfect.)
  I have no objection to your interpretations of what happens if someone explicitly says "mis", "mul", etc.
  I woudl note that "und" makes no statement of whether the content is a single or several languages, only that any/all language information is "not defined" for the specified scope.

If the user says nothing or says xml:lang="", it can't mean anything other than "undefined" (for which the tag is "und") or "unspecified" (for which there is no tag; regardless of whether the user intent was "I don't know" vs "I don't care" vs "I don't want to guess" vs "I don't want to say"; and regardless of if it is all a single language or a mixture of several languages). The tag "und" means that no input regarding the language is (languages are) provided. Why not say no xml:lang specifier and xml:lang="" are both interpreted as xml:lang="und" and be done with it.

It is quite clear that "zxx" means "I know this is not a linguistic"; that "art" means it is "an artificial language (invented or other non-natural language)"; that "mul" means "there are a mixture of languages which I may or may not choose to identify at a lower level in the document"; and that "mis" means it is "a language but I have no better identifier for it".

One can have a separate debate over whether "zxx" or "art" should be used for computer-programming languages, or whether computer-programming (as a group or individually) deserve their own tag(s); but that is not an "Internationalization" issue.


At 2007.04.12-18:15(-0700), Mark Davis wrote:

I think I agree with you in spirit, but not in precise details. The tag "und" means "undetermined", so when I encounter it I don't know whether the content contains one language, many languages, or no language. The tag "zxx" would mean that there is no language content, "mis" would mean that there is at least some language content, and "mul" would mean that there is language content, with more than one language.

I think to try to consider what the motivations of the tagger are may lead to misleading impressions. Assume for the moment that the tag is correct. From the perspective of the tagger, using "und" could mean, as you say, that the tagger doesn't know or care (or want to communicate, or what to spend the time to determine) what the language is or whether there is any language content there at all. There could be quite a variety of motivations for the tagger's using "und"; the key is what the reader of "und" can assume about the content, which is essentially nothing. With "mis", the situation is similar, but slightly narrower. The tagger may still not know much, or care much, but maybe cared enough to determine that there was something there, or maybe there was language content there, but there is no language code that correctly matches it (protogermanic, perhaps).

Similarly, using "may not" language is a bit too strong in your phrase "Whereas zxx says I 'may not' apply any of those language-based services because it is not a 'natural' language". Having content tagged with "zxx" doesn't restrict me from doing anything I want to; it just means if it was tagged correctly, it does not contain any language content. (I might decide that the tagger was mistaken -- when we at Google look at the tagging people actually do of web content, there is a fairly high percentage of both invalid tags and valid-but-incorrect tags.)

Mark

On 4/12/07, Stephen Deach <sdeach@adobe.com<mailto:sdeach@adobe.com>> wrote:
I think much of this discussion is dealing with terminology differences that are so narrow that one is discussing "the number of angels who can dance on the head of a pin". (In other words we are debating theology, not practice.) In reality, specifications are worded as carefully as possible, but interpretation is open to the reader's most common definition/redefinition/translation of the exact terminology.  -- So rather than debate what the "exact meaning" of a word/phrase is in each of these languages, maybe we should take a looser interpretation of what is written and then clarify the intent.
My reading of the ISO spec is that "und/undetermined" means "I don't know (or care, or am unwilling to state) what the language is (and have no closer alternative language identifier given the available options)". From a practical viewpoint, "und" indicates I can't assume any specific/preferred linguistic definitions for words in the content, nor can I assume any specific/preferred pronunciation-, spelling-, hyphenation-, and/or grammar-rules on the content; though I am allowed to attempt my own linguistic analysis to guess at the language. (Whereas zxx says I 'may not' apply any of those language-based services because it is not a 'natural' language and should not attempt any linguistic analysis to guess at the language.) I can't see any practical difference between "und" and "" (except that "" is disallowed in some processing environments) so why can't the documents simply say that 'a missing specification' or 'xml:lang=""' (should either occur), will be interpreted as "und".

It has been a while since I considered myself fluent in Swedish (and I intentionally ignored the lack of the dieresis in the original text as an indication that the translations were "lossy"). I just thought that some comment would force the necessary clarification of the translations.


At 2007.04.13-01:26(+0200), Kent Karlsson wrote:

Stephen Deach wrote:
> sv.xml:                       <language type="und">obestämt språk</language>

I thought "obestamt" was "unstated".
"Obestämt" literally means "undetermined". "Unstated" would be "osagt", "outtalat", or "ej angett"
("not given", closer to the current German translation).

Though I would agree that xml:lang="" is closer to "unstated" than "undetermined". I'm not sure
that that nit-picking leads anywhere in this case. But "unstated" is not the same as "undetermined";
it may well be determined, but just not stated... So maybe there is a difference worth bothering about.

        /kent k


---Steve Deach
   sdeach@adobe.com<mailto:sdeach@adobe.com>



--
Mar

---Steve Deach
   sdeach@adobe.com
Received on Friday, 13 April 2007 05:41:53 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 2 June 2009 19:17:13 GMT