RE: comments on rdf:text draft from Boris Motik on 2009-04-07 (public-rdf-text@w3.org from April to June 2009)

From: Boris Motik <boris.motik@comlab.ox.ac.uk>
Date: Tue, 7 Apr 2009 20:31:30 +0100
To: "'Phillips, Addison'" <addison@amazon.com>, <public-rdf-text@w3.org>
Cc: <public-i18n-core@w3.org>
Message-ID: <C4A0C4722A0048E3A0C689A201A4D0AC@wolf>
Hello,

> -----Original Message-----
> From: public-rdf-text-request@w3.org [mailto:public-rdf-text-request@w3.org]
> On Behalf Of Phillips, Addison
> Sent: 07 April 2009 15:24
> To: public-rdf-text@w3.org
> Cc: public-i18n-core@w3.org
> Subject: comments on rdf:text draft
> 
> Hi,
> 
> My comments follow. There are personal comments.
> 
> 1. The start of the intro is kind of unclear. It isn't really
> "internationalized text", but rather natural language (human language) text
> that we're dealing with. Where it says:
> 
> --
> Internationalized text - that is, text that additionally conveys information
> in terms of a language tag - is used in several existing W3C specifications,
> such as RDF, XML, OWL, and RIF.
> --
> 
> I would suggest instead something like:
> 
> --
> The storage and transmission of human (natural) language text is one important
> use of character data in most document formats. The natural language of the
> text needs to be identified for proper display, selection, or processing and
> this identification is usually conveyed via a "language tag".
> --
> 

Thanks for this comment. I'm afraid, however, that in response to Sandro's
comments, I have rewritten earlier today this part of the introduction. I've
adopted the "elevator pitch" that Sandro suggested. Please let me know should
you consider that the current intro needs further revision.

> 2. The second sentence uses the word "language" in a confusing way, since it
> is used for both "natural language" and for "data format language" (i.e. RIF,
> RDF, OWL, etc.). The use of the term language should be kept unambiguous.
> 

Ditto.

> 3. The intro to section 2 is still not quite right. Instead of the first
> paragraph, I think it suffices to say:
> 
> --
> A 'character' is an atomic unit of text, as defined in [Unicode] and/or
> [ISO/IEC 10646] and corresponding to the 'Char' production from [XML].
> --
> 

This formulation was taken from XML Schema. Nevertheless, your suggestion is an
improvement, modulo the fact that, if a character must match the 'Char'
production, it is not defined as in [Unicode]. Therefore, I've rewritten the
first two sentences like this:

A character is an atomic unit of text. Each character has a Universal Character
Set (UCS) code point [ISO/IEC 10646] (or, equivalently, a Unicode code point
[UNICODE]), which MUST match the Char production from XML [XML] thus ensuring
compatibility with XML Schema Datatypes [XML Schema Datatypes].

> 4. The sentence "Code points are written as U+ followed by the hexadecimal
> value of the code point" is not quite right. You might moderate this by saying
> "are represented by U+ (etc.) in this document". Although you barely use the
> U+ syntax in the document. Note that the sentence is also incomplete: the
> usual minimum length of a U+ hex sequence is four hex digits (U+00E9).
> 

I've rephrased the sentence like this:

Code points are represented in this document as U+ followed by a four-digit
hexadecimal value of the code point.

> 5. Quoting the number of available code points in Unicode seems slightly
> absurd to me. I see the assertion about cardinality, but I'm not what use it
> has.
> 

This is critical for OWL, where you can express existential restrictions over
the set of the available characters. I strongly believe this example should stay
in the document.

> I think your count is also wrong: every 'FFFF' and 'FFFE' code point is not a
> valid code point (so 1FFFE, 2FFFF, 3FFFE, and AFFFF are all invalid too). The
> actual count is 1112029 (or 0x10F7DD).
> 

How did you compute this? My number comes from what is defined by the 'Char'
production in XML. There, we have this:

Char ::=
 #x9                |   1
 #xA                |   1
 #xD                |   1
 [#x20-#xD7FF]      |   55295-32+1=55264
 [#xE000-#xFFFD]    |   65533-57344+1=8190
 [#x10000-#x10FFFF]     1114111-65536+1=1048576

This gives us 1048576+8190+55264+1+1+1=1112033 code points. I've updated the
spec to this number.

> Note that using the hex here is probably a good idea. There are 0x10FFFF code
> points in Unicode, of which a number are permanently reserved (invalid as
> Char).
> 
> 6. You refer to language tags being registered in this text:
> 
> --
> Furthermore, note that this definition does not assume a language tag to be
> registered with the IANA Language Subtag Registry as per BCP 47 [BCP 47]. An
> rdf:text implementation MAY choose to reject unregistered language tags.
> --
> 
> ... but in fact it is *subtags* that may (or may not) be registered. You would
> probably be better off referring to BCP 47's conformance criteria by name
> here. Perhaps:
> 
> --
> Furthermore, note that this definition corresponds to the 'well-formed' rather
> than the 'valid' class of conformance in [BCP 47]. A language tag MAY contain
> subtags that are not registered in the IANA Language Subtag Registry, although
> an rdf:text implementation MAY also choose to reject such invalid language
> tags.
> --
> 

Thanks -- this indeed sounds much better.

> 7. Please lose the 'foo-bar' example. It is invalid in more than one way.
> There are many ways to express the same idea using real subtags (tlh-AQ
> Klingon-as-used-in-Antarctica) or an unregistered subtag (en-fubar).
> 

OK, I've replaced foo-bar with en-fubar.

Nevertheless, I don't understand now whether foo-bar is a valid language tag. It
does seem to match the production from BCP 47, so I'd say it is. Your
explanation, however, suggests that the "en" part must be registered; is this
really the case? In any case, I strongly believe that the definitions *must not*
depend on any kind of a registry, as this would make the consequences of an OWL
2 ontology possibly vary over time.

> 8. It is ironic that "lc-langTag" itself is not lowercase.
> 

I've change this to "lc-langtag".

Thanks a lot for your comments -- I really appreciate them!

Regards,

	Boris

> I may have some more comments later today.
> 
> Addison
> 
> --
> 
> Addison Phillips
> Globalization Architect -- Lab126
> 
> Internationalization is not a feature.
> It is an architecture.
>
Received on Tuesday, 7 April 2009 19:32:44 UTC