RE: [Ltru] Re: For review: Tagging text with no language

From: Karen_Broome@spe.sony.com [mailto:Karen_Broome@spe.sony.com]

> Let's look at a real use case. Would you say this page
> is in "en" and "zxx" and that the sections of code
> have no linguistic value even though they are clearly
> intended to be read by humans and not machines? Or does
> context matter?

You're asking to have one tag that covers an entire document even though that document is mixed. What if I insert a quotation in Spanish in this mail? ¿Que vamos a hacer? It's no different. If you must use a single tag for the whole thing, then clearly this is predominantly in English. I don't know what you'd do if it were closer to 50-50.


> Interpreting "no linguistic content" as "not a human
> language, could be a programming language" could cause
> some problems. There may be a use case for programming
> languages to have their own tag if this is deemed
> appropriate for the 639 standards or IANA registry, and
> these languages are different than say, instrumental
> music in the Library of Congress or a sound effects
> track in a film (both zxx, I'd say).

Your argument is akin to someone saying that someone may want to code audio in Unicode. ISO 639 has defined a scope, human languages. Programming languages, electrical schematics, dance notation, bridge-hand notation, math formulas and engineering drawings are all graphic content that can be interpreted by humans. Some of these can be represented in text, but that does not change the fact that they are not a form of the kind of things coded by ISO 639, human languages.

> I think programming languages have specific
> identification and parsing needs and as such need to
> be treated differently.

As I suggested earlier, the scope defined by ISO 639 does not force RFC 4646bis to be limited to the same scope -- in fact, it cannot be. ("Language tags" already code things other than linguistic variety, written form in particular.) So, if you want to propose variant subtags to differentiate programming code from music notation, then I don't see why that couldn't be done.

But it would be out of scope for ISO 639 to code such a distinction, and it would be a non-conforming re-interpretation to say that zxx does not apply programming languages.


> The code in the article above should be rendered in
> Braille, for example, so it must be parsed. This makes
> it different from non-linguistic content.

You're confusing the language of content with the representation mode in some communicative technology. English content in Braille is still English, and so clearly different from zxx. That is not in any way comparable to discussing code in a programming language.


> How would you classify the page I cite?

As I mentioned above, this question is no different than asking how to come up with one tag for a page that contains content in both English and Spanish. On a *practical* level, I would tag that article as en and ignore the fact that it contains XML code snippets; but if someone was being careful to tag elements within the document correctly, then the code snippets should be tagged zxx. (That is, unless you want to register variant subtags to differentiate between different kinds of non-linguistic content.)


Peter

Received on Tuesday, 17 April 2007 00:39:08 UTC