comments on rdf:text draft from Phillips, Addison on 2009-04-07 (public-i18n-core@w3.org from April to June 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 7 Apr 2009 07:23:46 -0700
To: "public-rdf-text@w3.org" <public-rdf-text@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA019F34E720@EX-SEA5-D.ant.amazon.com>

Hi,

My comments follow. There are personal comments.

1. The start of the intro is kind of unclear. It isn't really "internationalized text", but rather natural language (human language) text that we're dealing with. Where it says:

--
Internationalized text — that is, text that additionally conveys information in terms of a language tag — is used in several existing W3C specifications, such as RDF, XML, OWL, and RIF.
--

I would suggest instead something like:

--
The storage and transmission of human (natural) language text is one important use of character data in most document formats. The natural language of the text needs to be identified for proper display, selection, or processing and this identification is usually conveyed via a "language tag".
--

2. The second sentence uses the word "language" in a confusing way, since it is used for both "natural language" and for "data format language" (i.e. RIF, RDF, OWL, etc.). The use of the term language should be kept unambiguous.

3. The intro to section 2 is still not quite right. Instead of the first paragraph, I think it suffices to say:

--
A 'character' is an atomic unit of text, as defined in [Unicode] and/or [ISO/IEC 10646] and corresponding to the 'Char' production from [XML].
--

4. The sentence "Code points are written as U+ followed by the hexadecimal value of the code point" is not quite right. You might moderate this by saying "are represented by U+ (etc.) in this document". Although you barely use the U+ syntax in the document. Note that the sentence is also incomplete: the usual minimum length of a U+ hex sequence is four hex digits (U+00E9).

5. Quoting the number of available code points in Unicode seems slightly absurd to me. I see the assertion about cardinality, but I'm not what use it has.

I think your count is also wrong: every 'FFFF' and 'FFFE' code point is not a valid code point (so 1FFFE, 2FFFF, 3FFFE, and AFFFF are all invalid too). The actual count is 1112029 (or 0x10F7DD).

Note that using the hex here is probably a good idea. There are 0x10FFFF code points in Unicode, of which a number are permanently reserved (invalid as Char).

6. You refer to language tags being registered in this text:

--
Furthermore, note that this definition does not assume a language tag to be registered with the IANA Language Subtag Registry as per BCP 47 [BCP 47]. An rdf:text implementation MAY choose to reject unregistered language tags.
--

... but in fact it is *subtags* that may (or may not) be registered. You would probably be better off referring to BCP 47's conformance criteria by name here. Perhaps:

--
Furthermore, note that this definition corresponds to the 'well-formed' rather than the 'valid' class of conformance in [BCP 47]. A language tag MAY contain subtags that are not registered in the IANA Language Subtag Registry, although an rdf:text implementation MAY also choose to reject such invalid language tags.
--

7. Please lose the 'foo-bar' example. It is invalid in more than one way. There are many ways to express the same idea using real subtags (tlh-AQ Klingon-as-used-in-Antarctica) or an unregistered subtag (en-fubar).

8. It is ironic that "lc-langTag" itself is not lowercase.

I may have some more comments later today.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.

Received on Tuesday, 7 April 2009 14:24:25 UTC