RE: comments on rdf:text draft from Phillips, Addison on 2009-04-07 (public-i18n-core@w3.org from April to June 2009)

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 7 Apr 2009 13:38:42 -0700
To: Boris Motik <boris.motik@comlab.ox.ac.uk>, "public-rdf-text@w3.org" <public-rdf-text@w3.org>
CC: "public-i18n-core@w3.org" <public-i18n-core@w3.org>
Message-ID: <4D25F22093241741BC1D0EEBC2DBB1DA019F34ED1E@EX-SEA5-D.ant.amazon.com>
> 
> Thanks for this comment. I'm afraid, however, that in response to
> Sandro's
> comments, I have rewritten earlier today this part of the
> introduction. I've
> adopted the "elevator pitch" that Sandro suggested. Please let me
> know should
> you consider that the current intro needs further revision.
> 

The new text is okay, although I think it might leave the average reader slightly mystifying what rdf:text is for. There is a lot of text about different literal flavors, but no mention about why the presence or absence of a language tag is interesting. And it concludes with this paragraph, which suggests some confusion about how to represent text in RDF:

--
RDF tools may use other mechanisms for representing internationalized text, such as the xml:lang feature of the rdf:XMLLiteral datatype. The rdf:text datatype does not provide a replacement for such mechanisms.
--

It seems to me that the introduction should say why these three classes of literals are related and why rdf:text might be interesting. I would at least include some sort of notation about why language tags might be needed. Perhaps add a third bullet point:

--
* Literals often contain human-readable natural language text. RDF needs a mechanism for representing literals in various different languages, for selecting the proper literal in a specific language, and to allow applications to keep language information with literals to facilitate processing that is language affected.
--

Minor notes: first bullet s/literals/literal/
Also: "internationalized text" is a misnomer. Perhaps "text in different languages"??

> 
> > 3. The intro to section 2 is still not quite right. Instead of
> the first
> > paragraph, I think it suffices to say:
> >
> > --
> > A 'character' is an atomic unit of text, as defined in [Unicode]
> and/or
> > [ISO/IEC 10646] and corresponding to the 'Char' production from
> [XML].
> > --
> >
> 
> This formulation was taken from XML Schema. Nevertheless, your
> suggestion is an
> improvement, modulo the fact that, if a character must match the
> 'Char'
> production, it is not defined as in [Unicode]. Therefore, I've
> rewritten the first two sentences like this:

I'm not sure what you mean by this. Unicode defines a range of code points and 'Char' mirrors it. The definition of 'Char' actually says "Unicode code points" :-).

> 
> A character is an atomic unit of text. Each character has a
> Universal Character
> Set (UCS) code point [ISO/IEC 10646] (or, equivalently, a Unicode
> code point
> [UNICODE]), which MUST match the Char production from XML [XML]
> thus ensuring
> compatibility with XML Schema Datatypes [XML Schema Datatypes].

This looks fine.

> 
> > 4. The sentence "Code points are written as U+ followed by the
> hexadecimal
> > value of the code point" is not quite right. You might moderate
> this by saying
> > "are represented by U+ (etc.) in this document". Although you
> barely use the
> > U+ syntax in the document. Note that the sentence is also
> incomplete: the
> > usual minimum length of a U+ hex sequence is four hex digits
> (U+00E9).
> >
> 
> I've rephrased the sentence like this:
> 
> Code points are represented in this document as U+ followed by a
> four-digit hexadecimal value of the code point.

That sounds good, although I'd even tend to say "are sometimes represented", since there are plenty of code points that are represented as ASCII characters :-).

> 
> > 5. Quoting the number of available code points in Unicode seems
> slightly
> > absurd to me. I see the assertion about cardinality, but I'm not
> what use it
> > has.
> >
> 
> This is critical for OWL, where you can express existential
> restrictions over
> the set of the available characters. I strongly believe this
> example should stay in the document.

I won't debate OWL's needs here, except to point out that many (in fact, the majority) of Unicode code points are not even assigned.

> 
> Char ::=
>  #x9                |   1
>  #xA                |   1
>  #xD                |   1
>  [#x20-#xD7FF]      |   55295-32+1=55264
>  [#xE000-#xFFFD]    |   65533-57344+1=8190
>  [#x10000-#x10FFFF]     1114111-65536+1=1048576
> 
> This gives us 1048576+8190+55264+1+1+1=1112033 code points. I've
> updated the spec to this number.

They fail to subtract the other non-characters at the end of each plane... an oversight on their part, not yours.


> 
> Nevertheless, I don't understand now whether foo-bar is a valid
> language tag. It
> does seem to match the production from BCP 47, so I'd say it is.
> Your
> explanation, however, suggests that the "en" part must be
> registered; is this
> really the case? In any case, I strongly believe that the
> definitions *must not*
> depend on any kind of a registry, as this would make the
> consequences of an OWL
> 2 ontology possibly vary over time.

This is why I refered specifically to the conformance requirements in BCP 47, which defines two separate terms:

- "well-formed" means matching the ABNF/grammar but not necessarily checking to see if the subtags are registered. This is the sort of conformance you have.
- "valid" means "well-formed" plus checking that the subtags are each properly registered (and a few other very minor checks on stuff like extensions). This is not the sort of conformance you require, although you allow it.

Hope this helps.

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.
Received on Tuesday, 7 April 2009 20:39:27 UTC