RE: I18N issues an OWL2 from Phillips, Addison on 2008-07-14 (public-owl-wg@w3.org from July 2008)

From: Phillips, Addison <addison@amazon.com>
Date: Mon, 14 Jul 2008 08:49:19 -0700
To: Felix Sasaki <fsasaki@w3.org>, Axel Polleres <axel.polleres@deri.org>
CC: Ivan Herman <ivan@w3.org>, Jie Bao <baojie@cs.rpi.edu>, "public-owl-wg@w3.org" <public-owl-wg@w3.org>, "public-i18n-core-comments@w3.org" <public-i18n-core@w3.org>, "public-rif-comments@w3.org" <public-rif-comments@w3.org>, Boris Motik <boris.motik@comlab.ox.ac.uk>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA013BD2AF9A@EX-SEA5-D.ant.amazon.com>

> >
> > Felix Sasaki wrote:
>
> That is, for the language tags "art-lojban" and "jbo" there is no
> hierarchy. The language tags express the same language.

However: it is always permitted to treat these two tags as being different. BCP 47 defines two classes of conformance (well-formed and validating). Validating language tag processors may canonicalize deprecated tags into their preferred form before matching, but this isn't required. Most specifications are "well-formed" and not "validating" anyway: none of the matching schemes in BCP 47 are validating (some aren't even "well-formed").

>
> That is, Mandarine Chinese could be tagged as "zh-cmn" or "cmn" or
> "zh.
> Again you have no clear "length to hierarchy" relation.

This isn't quite accurate. "zh" doesn't mean "Mandarin Chinese". It only means Chinese. This doesn't mean that some content in Mandarin can't be tagged with it--quite the contrary--but that this doesn't necessarily enter into tag processing.

The problem here is that language tag matching/selection/etc. is really quite simplistic. It is based solely on (case-insensitive) string matching, with no recourse to the registry, the ABNF, or anything else. If you know the tag's structure, you can use it to your advantage, since the tag's structure was designed to help you do various operations in a useful manner (such as selecting all tags with a given script or a given region).

However, we often struggle with it in XML contexts because attribute matching is usually Boolean all-or-nothing matching rather than looking for the trailing hyphen. OWL has a predilection to this kind of matching, since, by design, it is about "is-a" relationships that exactly match strings.

This suggests to me that language tags would most easily be implemented by using Basic Filtering from RFC 4647. That is, the language tag in an "internationalizedString" matches a given request iff the request is an exact prefix (where the next character is a hyphen or end of string) of the value.

That is:

  LanguageAssertion { hasLanguage "de" }

Would match:

  "gruß dich"^^lang:de
  "gruß dich"^^lang:de-AT
  "gruß dich"^^lang:de-1901
  "gruß dich"^^lang:de-AT-1901
  ... usw...

But not:

  "whoa"^^lang:del  # a string in the Delaware language; whoa isn't actually probably a word
  "bonjour"^^"lang:fr
  ... etc...

This addresses the "macrolanguage problem". People would have to know to construct certain kinds of queries to work around the multiplicity of tag formats. But I don't think that is OWL's problem particularly.

Hope this helps.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Monday, 14 July 2008 15:50:02 UTC