- From: C. M. Sperberg-McQueen <cmsmcq@blackmesatech.com>
- Date: Wed, 6 May 2009 13:58:18 -0600
- To: "Boris Motik" <boris.motik@comlab.ox.ac.uk>
- Cc: "C. M. Sperberg-McQueen" <cmsmcq@blackmesatech.com>, public-rdf-text@w3.org, public-i18n-core@w3.org
On 6 May 2009, at 10:34 , Boris Motik wrote: > Hello, > I'm sending this e-mail to public-rdf-text@w3.org only because > I feel that further discussion might be needed to resolve your > comments appropriately. Once when we reach an agreement, we > shall send you an official response through > public-owl-comments@we.org. Fine with me. Since the i18n working group has expressed an interest in the topic, I am taking the liberty of adding public-i18n-core@w3.org to the CC list. > I'm really sorry about the missing link. The following URL > summarizes the differences most of which have been made in > response to your comment: > http://www.w3.org/2007/OWL/wiki/index.php?title=InternationalizedStringSpec&diff=23289&oldid=22506 Thank you! I've inspected all the relevant changes (at least, I think so) and congratulate you on the tool that presents such clear change histories. The record should show that I checked to see whether any uses of the xs: prefix needed, on a strict reading, to be interpreted as QNames instead of as CURIEs, and believe that there are none, so that you are correct to limit your remarks about QNames to the mentions of functions and operators in section 5. > If I correctly understood your response, you have only one > outstanding issue, which is summarized below. Correct. >> [snip] >>> Point (5): Internationalization issues >>> We agree that these might be important issues; however, they >>> clearly exceed the scope of rdf:text. The main goal of this >>> specification was to provide adequate names for the sets of >>> plain literals in RDF, and not to solve all >>> internationalization problems one might have. >> I regret to say that I do not think this is a satisfactory >> response to the issue I raised. >> The fact of the matter is that while rdf:text appears to be >> aimed at supporting the representation of natural-language >> utterances, it is able to do so only for some writing systems, >> and has at best very poor support for others. That is, it has >> serious internationalization issues and is not really adequate >> for the general task of representing natural-language >> utterances. >> There are three things you could do, or try to do, about this >> fact. >> (1) You could fix it. >> [snip] > We fully believe this goal to be untenable. The rdf:text > specification arose out the need to provide a name for the > plain literals in RDF, which we needed in RIF and RDF. The goal > of this specification was thus rather light-weight and we would > really prefer to keep it so. That's understandable. >> (2) You could admit the problem, point it out to the reader, >> and explain (as far as you know how) how best to work around >> it. >> [snip] > We could point out (e.g., in the Introduction), that rdf:text > is not meant to serve as a panacea for the internationalization > problems. We, however, simply don't have enough knowledge in > the area to discuss the pros, cons, and the scope. It would be > very useful for us if you could point us to the concrete > shortcomings; if you could do that, we would be happy to > mention them in the introduction. I am by no means an expert in internationalization or the writing systems of the world. So please take advice from more knowledgable people (I recommend the participants in the W3C Internationalization Activity, if they can spare the time) before trusting all the details of what I say here. I believe there are two areas in which rdf:text falls short of adequacy for the general case. In each area, others more expert than I may be able to explain the problem to you better, or to persuade me that it's not really a problem. And, of course, there may be other areas where there are problems. (1) Ruby. If on the basis of the title "rdf:text: A Datatype for Internationalized Text" I decide that I wish, using rdf:text, to record for comment the utterances shown in figures 3.2, 3.4, and 3.6 of the W3C "Ruby Annotation" recommendation (http://www.w3.org/TR/2001/REC-ruby-20010531/), how do I do it? How do I say, in RDF, "Michael Sperberg-McQueen does not read Han characters and thus cannot understand the utterance [insert rdf:text literal here]"? No string of Unicode characters, interpreted solely as characters, provides an adequate record of these utterances: a basic property of Unicode character sequences (or other character sequences -- or of any sequences at all, for that matter) is that they are unidimensional, while the given text here is not. The best advice of experts appears to be provided by the Ruby Annotation rec itself, namely to use XML markup to associate the base text with its Ruby annotations, segment by segment. [By "interpreted solely as characters" I mean not interpreted as markup or in some other way given meaning beyond that documented in the Universal Character Set.] (2) Bidirectional text. The advice given by virtually all experts today is that bidirectional text should be represented, in data streams or in files or in the backing store of a display buffer, in 'logical order': the first character of a word comes first, the last character comes last. When a text mixes (for example) English and Hebrew, the visual ordering of characters and tokens on the display must account for the differences in conventional writing direction in English and Hebrew. Numbers written using Arabic numerals complicate the story because they are written in the same way in Hebrew and Arabic as they are in English and other Western European languages. (Because there is no obvious way to decide whether the lowest-order digit or the highest-order digit is logically first, there is some room for debate over whether writers of Hebrew and Arabic write their numbers left to right or writers of the Latin script write their numbers right to left; I have heard both views maintained. But it's a purely theoretical question: conventional practice is to treat the highest-order digit of a numeral as its first digit, and to store it first in the backing store.) The result is that even in monolingual Hebrew or Arabic text, the sequence of characters in a line of the display may differ from the sequence of characters in the backing store. The Unicode Consortium has done a great deal of work to define algorithms for mapping from the logical ordering in the backing story to the layout of characters in the display, and the Universal Character Set defined by Unicode and by ISO 10646 includes characters intended to help control that algorithm. One might assume, then (as I did for a long time) that bidirectional text can in fact be represented as a sequence of characters, possibly including bidi control characters. I have been told by people more knowledgeable than I that this hope is naive, that the use of bidi control characters is NOT recommended, and that the use of XML markup is the preferred solution to the problem. If memory serves, the most forceful of my informants was Martin Dürst, then at W3C, but I don't believe he was the only one who told me this. The document "Unicode in XML and other Markup Languages" which is both Unicode Technical Report 20 and a W3C Working Group Note, recommends in section 3.3 "Bidi Embedding Controls (LRE, RLE, LRO, RLO, PDF), U+202A..U+202E" (http://www.w3.org/TR/unicode-xml/#Bidi) that the bidi embedding controls of the UCS *not* be used, but that XML markup be used instead to deal with cases where the normal bidi algorithm would otherwise provide the wrong results. The important points to note here seem to me to be: (a) The standard Unicode bidi algorithm may need to be overridden not only for polyglot or macaronic text (which may be excluded from the scope of rdf:text in any case) but even for monolingual text. (b) The responsible experts both in W3C and in the Unicode Consortium recommend that markup be used to handle such cases. Perhaps the right thing to do in both of these situations is to use an XML literal, rather than an instance of rdf:text, to represent the text I am interested in working with. The XML shown in figure 3.5 of Ruby Annotation provides an example of what that might look like for the Ruby; the XHTML spec's discussion of the bdo element has examples of what bidi text might look like. Perhaps other methods of representing such text are preferable to XML literals, in an RDF or OWL context. I don't know; I don't understand RDF or OWL well enough to have an informed opinion. But if you are seeking to define a datatype for internationalized text, then I believe it's incumbent upon you to have or develop an informed opinion, and to tell the reader of your spec either how to use rdf:text for ruby or for bidi text, or when to use something else, and when to use it. Since you have several times used the word "panacea", I should perhaps say explicitly that I am not looking for a panacea, and I don't think vague statement that rdf:text is not intended to provide one will solve the problem. Your responsibility, if you are specifying a datatype for internationalized text (or for that matter, if you are specifying anything at all for the *World Wide* Web Consortium, is either to make it work for all languages, or if that is not possible, then AT LEAST to describe where rdf:text succeeds, and where it fall short, as a datatype for internationalized text, and what to use instead when it is not suitable. By analogy, the XSD spec has a number of passages where we point out that this or that datatype or construct is not suitable for natural-language text and say what to use instead (in the case of most natural-language text: mixed content!). >> [snip] >> You can resolve my objection on this point by adding a note to >> the spec (1) pointing out that rdf:text is not suitable for, >> and not intended for, the representation of natural-language >> text or utterances, or (less strongly) that rdf:text cannot be >> used for the adequate representation of natural-language text >> in writing systems which require bidi markup or ruby markup, >> (2) explaining what mechanisms should be used instead, when >> the text to be represented requires markup, and optionally (3) >> explaining that this state of affairs is forced upon you by >> the requirement of compatibility with the existing plain >> literals of RDF. The note does not need to be long, or >> elaborate. It just needs to point out the problem and suggest >> ways of dealing with it. > As I mentioned above, we can try to address your comment by (1) > and possibly mentioning (3). I don't think we would like to > talk about "this solution being forced on us": we are not in > the position where we can afford to be critical of RDF or any > other technology from the point of internationalization > requirements; this clearly exceeds our knowledge and scope. We > are quite skeptical of (2): we do not posses the sufficient > knowledge to make a useful comment there. > Please let me know whether such changes would address your > objections. Also, I would really appreciate it if you could > point out the issues that rdf:text does not address in a > suitable manner. I hope that the discussion above satisfies the request in your last sentence. As to the forcing: I agree that you will wish to be careful in your wording, and in your position I would try to avoid writing the word "forced" into the spec, if I could. But outside the bounds of the spec and carefuly nuanced prose, I think either you must agree that you are forced into your current design for the sake of compatibility with RDF literals, or else you must lose the rationale for failing to change your design to make it better support Japanese, Chinese, Hebrew, Arabic, Persian, and other languages with what you might call 'complex' writing systems. To make a concrete proposal for purposes of discussion, I suggest that you add a fourth bullet item to the list at the beginning of section 4 of the spec. If I were drafting it, the first draft might read like this: - Like xsd:string and the plain literals (with or without language tags) of RDF, typed rdf:text literals are suitable primarily for text that can be adequately represented as a sequence of UCS characters, without additional information or markup. They are not satisfactory for the representation of text with Ruby annotation or bidirectional text in which the default Unicode bidirectional algorithm fails to produce acceptable results. For such material, it is recommended that values of the rdf:XMLLiteral datatype be used instead; since it allows embedded markup, it can readily be usd for such values. Optionally break into two paragraphs before "They are not satisfactory" and replace "They" with "Typed rdf:text literals". This draft assumes that the right way to handle the problem is to use an XML literal; obviously, if you reach a different conclusion, then that bit needs to change. I am not insistent about the placement of the additional text, nor about the details of the wording. (That is, I don't insist on this specific wording. But the details of the wording finally chosen do make a difference, so I would like to see what you come up with before closing the issue for good.) But you do need to say something. It's really not tenable for a spec defining and internationalized text datatype to have nothing to say about the treatment of textual material that doesn't fit comfortably into sequences of UCS characters. Surely this problem has come up before: If RDF can handle them, say how. If RDF cannot handle them, then the entire Semantic Web Activity has a problem. I hope this helps clarify my concerns. Thank you for your time. -- **************************************************************** * C. M. Sperberg-McQueen, Black Mesa Technologies LLC * http://www.blackmesatech.com * http://cmsmcq.com/mib * http://balisage.net ****************************************************************
Received on Wednesday, 6 May 2009 19:58:58 UTC