Re: [LC response] To C. M. Sperberg-McQueen from C. M. Sperberg-McQueen on 2009-05-06 (public-rdf-text@w3.org from April to June 2009)

From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
Date: Wed, 6 May 2009 13:58:18 -0600
To: "Boris Motik" <boris.motik@comlab.ox.ac.uk>
Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-rdf-text@w3.org, public-i18n-core@w3.org
Message-Id: <A1165E0D-2785-4E65-A926-B84F330B2721@blackmesatech.com>
On 6 May 2009, at 10:34 , Boris Motik wrote:

 > Hello,

 > I'm sending this e-mail to public-rdf-text@w3.org only because
 > I feel that further discussion might be needed to resolve your
 > comments appropriately. Once when we reach an agreement, we
 > shall send you an official response through
 > public-owl-comments@we.org.

Fine with me.  Since the i18n working group has expressed an
interest in the topic, I am taking the liberty of adding
public-i18n-core@w3.org to the CC list.

 > I'm really sorry about the missing link. The following URL
 > summarizes the differences most of which have been made in
 > response to your comment:

 > http://www.w3.org/2007/OWL/wiki/index.php?title=InternationalizedStringSpec&diff=23289&oldid=22506

Thank you! I've inspected all the relevant changes (at least, I
think so) and congratulate you on the tool that presents such
clear change histories.

The record should show that I checked to see whether any uses of
the xs: prefix needed, on a strict reading, to be interpreted as
QNames instead of as CURIEs, and believe that there are none, so
that you are correct to limit your remarks about QNames to the
mentions of functions and operators in section 5.

 > If I correctly understood your response, you have only one
 > outstanding issue, which is summarized below.

Correct.

 >> [snip]

 >>> Point (5): Internationalization issues

 >>> We agree that these might be important issues; however, they
 >>> clearly exceed the scope of rdf:text. The main goal of this
 >>> specification was to provide adequate names for the sets of
 >>> plain literals in RDF, and not to solve all
 >>> internationalization problems one might have.

 >> I regret to say that I do not think this is a satisfactory
 >> response to the issue I raised.

 >> The fact of the matter is that while rdf:text appears to be
 >> aimed at supporting the representation of natural-language
 >> utterances, it is able to do so only for some writing systems,
 >> and has at best very poor support for others.  That is, it has
 >> serious internationalization issues and is not really adequate
 >> for the general task of representing natural-language
 >> utterances.

 >> There are three things you could do, or try to do, about this
 >> fact.

 >> (1) You could fix it.

 >> [snip]

 > We fully believe this goal to be untenable. The rdf:text
 > specification arose out the need to provide a name for the
 > plain literals in RDF, which we needed in RIF and RDF. The goal
 > of this specification was thus rather light-weight and we would
 > really prefer to keep it so.

That's understandable.

 >> (2) You could admit the problem, point it out to the reader,
 >> and explain (as far as you know how) how best to work around
 >> it.

 >> [snip]

 > We could point out (e.g., in the Introduction), that rdf:text
 > is not meant to serve as a panacea for the internationalization
 > problems. We, however, simply don't have enough knowledge in
 > the area to discuss the pros, cons, and the scope. It would be
 > very useful for us if you could point us to the concrete
 > shortcomings; if you could do that, we would be happy to
 > mention them in the introduction.

I am by no means an expert in internationalization or the writing
systems of the world.  So please take advice from more
knowledgable people (I recommend the participants in the W3C
Internationalization Activity, if they can spare the time) before
trusting all the details of what I say here.

I believe there are two areas in which rdf:text falls short of
adequacy for the general case.  In each area, others more expert
than I may be able to explain the problem to you better, or to
persuade me that it's not really a problem.  And, of course,
there may be other areas where there are problems.

   (1) Ruby.

       If on the basis of the title "rdf:text: A Datatype for
       Internationalized Text" I decide that I wish, using
       rdf:text, to record for comment the utterances shown in
       figures 3.2, 3.4, and 3.6 of the W3C "Ruby Annotation"
       recommendation
       (http://www.w3.org/TR/2001/REC-ruby-20010531/), how do I do
       it?

       How do I say, in RDF, "Michael Sperberg-McQueen does not
       read Han characters and thus cannot understand the
       utterance [insert rdf:text literal here]"?

       No string of Unicode characters, interpreted solely as
       characters, provides an adequate record of these
       utterances: a basic property of Unicode character sequences
       (or other character sequences -- or of any sequences at
       all, for that matter) is that they are unidimensional,
       while the given text here is not.  The best advice of
       experts appears to be provided by the Ruby Annotation rec
       itself, namely to use XML markup to associate the base text
       with its Ruby annotations, segment by segment.

       [By "interpreted solely as characters" I mean not
       interpreted as markup or in some other way given meaning
       beyond that documented in the Universal Character Set.]

   (2) Bidirectional text.

       The advice given by virtually all experts today is that
       bidirectional text should be represented, in data streams
       or in files or in the backing store of a display buffer, in
       'logical order': the first character of a word comes first,
       the last character comes last.  When a text mixes (for
       example) English and Hebrew, the visual ordering of
       characters and tokens on the display must account for the
       differences in conventional writing direction in English
       and Hebrew.  Numbers written using Arabic numerals
       complicate the story because they are written in the same
       way in Hebrew and Arabic as they are in English and other
       Western European languages.  (Because there is no obvious
       way to decide whether the lowest-order digit or the
       highest-order digit is logically first, there is some room
       for debate over whether writers of Hebrew and Arabic write
       their numbers left to right or writers of the Latin
       script write their numbers right to left; I have heard both
       views maintained.  But it's a purely theoretical question:
       conventional practice is to treat the highest-order digit
       of a numeral as its first digit, and to store it first in
       the backing store.)  The result is that even in monolingual
       Hebrew or Arabic text, the sequence of characters in a line
       of the display may differ from the sequence of characters
       in the backing store.

       The Unicode Consortium has done a great deal of work to
       define algorithms for mapping from the logical ordering in
       the backing story to the layout of characters in the
       display, and the Universal Character Set defined by Unicode
       and by ISO 10646 includes characters intended to help
       control that algorithm.

       One might assume, then (as I did for a long time) that
       bidirectional text can in fact be represented as a sequence
       of characters, possibly including bidi control characters.

       I have been told by people more knowledgeable than I that
       this hope is naive, that the use of bidi control characters
       is NOT recommended, and that the use of XML markup is the
       preferred solution to the problem.  If memory serves, the
       most forceful of my informants was Martin Dürst, then at
       W3C, but I don't believe he was the only one who told me
       this.

       The document "Unicode in XML and other Markup Languages"
       which is both Unicode Technical Report 20 and a W3C Working
       Group Note, recommends in section 3.3 "Bidi Embedding
       Controls (LRE, RLE, LRO, RLO, PDF), U+202A..U+202E"
       (http://www.w3.org/TR/unicode-xml/#Bidi) that the bidi
       embedding controls of the UCS *not* be used, but that XML
       markup be used instead to deal with cases where the normal
       bidi algorithm would otherwise provide the wrong results.

       The important points to note here seem to me to be:

       (a) The standard Unicode bidi algorithm may need to be
       overridden not only for polyglot or macaronic text (which
       may be excluded from the scope of rdf:text in any case) but
       even for monolingual text.

       (b) The responsible experts both in W3C and in the Unicode
       Consortium recommend that markup be used to handle
       such cases.

Perhaps the right thing to do in both of these situations is to
use an XML literal, rather than an instance of rdf:text, to
represent the text I am interested in working with.  The XML
shown in figure 3.5 of Ruby Annotation provides an example of
what that might look like for the Ruby; the XHTML spec's
discussion of the bdo element has examples of what bidi text
might look like.

Perhaps other methods of representing such text are preferable to
XML literals, in an RDF or OWL context.  I don't know; I don't
understand RDF or OWL well enough to have an informed opinion.

But if you are seeking to define a datatype for internationalized
text, then I believe it's incumbent upon you to have or develop
an informed opinion, and to tell the reader of your spec either
how to use rdf:text for ruby or for bidi text, or when to use
something else, and when to use it.

Since you have several times used the word "panacea", I should
perhaps say explicitly that I am not looking for a panacea, and I
don't think vague statement that rdf:text is not intended to
provide one will solve the problem.  Your responsibility, if you
are specifying a datatype for internationalized text (or for that
matter, if you are specifying anything at all for the *World
Wide* Web Consortium, is either to make it work for all
languages, or if that is not possible, then AT LEAST to describe
where rdf:text succeeds, and where it fall short, as a datatype
for internationalized text, and what to use instead when it is
not suitable.

By analogy, the XSD spec has a number of passages where we point
out that this or that datatype or construct is not suitable for
natural-language text and say what to use instead (in the case of
most natural-language text:  mixed content!).

 >> [snip]

 >> You can resolve my objection on this point by adding a note to
 >> the spec (1) pointing out that rdf:text is not suitable for,
 >> and not intended for, the representation of natural-language
 >> text or utterances, or (less strongly) that rdf:text cannot be
 >> used for the adequate representation of natural-language text
 >> in writing systems which require bidi markup or ruby markup,
 >> (2) explaining what mechanisms should be used instead, when
 >> the text to be represented requires markup, and optionally (3)
 >> explaining that this state of affairs is forced upon you by
 >> the requirement of compatibility with the existing plain
 >> literals of RDF.  The note does not need to be long, or
 >> elaborate.  It just needs to point out the problem and suggest
 >> ways of dealing with it.

 > As I mentioned above, we can try to address your comment by (1)
 > and possibly mentioning (3). I don't think we would like to
 > talk about "this solution being forced on us": we are not in
 > the position where we can afford to be critical of RDF or any
 > other technology from the point of internationalization
 > requirements; this clearly exceeds our knowledge and scope. We
 > are quite skeptical of (2): we do not posses the sufficient
 > knowledge to make a useful comment there.

 > Please let me know whether such changes would address your
 > objections. Also, I would really appreciate it if you could
 > point out the issues that rdf:text does not address in a
 > suitable manner.

I hope that the discussion above satisfies the request in your
last sentence.

As to the forcing: I agree that you will wish to be careful in
your wording, and in your position I would try to avoid writing
the word "forced" into the spec, if I could.  But outside the
bounds of the spec and carefuly nuanced prose, I think either you
must agree that you are forced into your current design for the
sake of compatibility with RDF literals, or else you must lose
the rationale for failing to change your design to make it better
support Japanese, Chinese, Hebrew, Arabic, Persian, and other
languages with what you might call 'complex' writing systems.

To make a concrete proposal for purposes of discussion, I suggest
that you add a fourth bullet item to the list at the beginning of
section 4 of the spec.  If I were drafting it, the first draft
might read like this:

   - Like xsd:string and the plain literals (with or without
     language tags) of RDF, typed rdf:text literals are suitable
     primarily for text that can be adequately represented as
     a sequence of UCS characters, without additional information
     or markup.  They are not satisfactory for the representation
     of text with Ruby annotation or bidirectional text in
     which the default Unicode bidirectional algorithm fails
     to produce acceptable results. For such material, it is
     recommended that values of the rdf:XMLLiteral datatype
     be used instead; since it allows embedded markup, it can
     readily be usd for such values.

Optionally break into two paragraphs before "They are not
satisfactory" and replace "They" with "Typed rdf:text literals".

This draft assumes that the right way to handle the problem is to
use an XML literal; obviously, if you reach a different
conclusion, then that bit needs to change.

I am not insistent about the placement of the additional text,
nor about the details of the wording.  (That is, I don't insist
on this specific wording.  But the details of the wording finally
chosen do make a difference, so I would like to see what you come
up with before closing the issue for good.)

But you do need to say something.  It's really not tenable for a
spec defining and internationalized text datatype to have nothing
to say about the treatment of textual material that doesn't fit
comfortably into sequences of UCS characters.  Surely this
problem has come up before: If RDF can handle them, say how.  If
RDF cannot handle them, then the entire Semantic Web Activity has
a problem.

I hope this helps clarify my concerns.  Thank you for your time.


-- 
****************************************************************
* C. M. Sperberg-McQueen, Black Mesa Technologies LLC
* http://www.blackmesatech.com
* http://cmsmcq.com/mib
* http://balisage.net
****************************************************************
Received on Wednesday, 6 May 2009 19:58:58 UTC