- From: Phillips, Addison <addison@amazon.com>
- Date: Tue, 20 Oct 2009 07:50:38 -0700
- To: Michael Kay <mike@saxonica.com>, "www-international@w3.org" <www-international@w3.org>
Hello Michael, Thank you for the comments. The charreq document is mainly of historical interest, as you note. I (or others on the WG) may have some response to the individual comments, but my first response would be: read all three parts of CharMod instead. These documents are the response to these requirements and are far more valuable and useful than this document. http://www.w3.org/TR/charmod http://www.w3.org/TR/charmod-norm/ http://www.w3.org/TR/charmod-resid/ Best Regards, Addison Addison Phillips Globalization Architect -- Lab126 Internationalization is not a feature. It is an architecture. > -----Original Message----- > From: www-international-request@w3.org [mailto:www-international- > request@w3.org] On Behalf Of Michael Kay > Sent: Tuesday, October 20, 2009 2:44 AM > To: www-international@w3.org > Subject: Character model for the web: string identity and string > indexing > > > I realise that the document http://www.w3.org/TR/2009/NOTE-charreq- > 20090915/ > is largely historical, however I took the opportunity to read > through it to > see whether it formed useful input for development of the QT > specifications, > and with the endorsement of the joint XSL and XQuery working groups > I am > submitting the following comments for the record. > > Michael Kay > Saxonica > > > -----Original Message----- > > From: w3c-xsl-query-request@w3.org > > [mailto:w3c-xsl-query-request@w3.org] On Behalf Of Michael Kay > > Sent: 06 October 2009 23:57 > > To: w3c-xsl-query@w3.org > > Subject: Character model for the web: string identity > > > > Action A-412-02 Mike Kay to review the Requirements for > > String Identity Matching and String Indexing note to see if > > it has any impact on FO. > > > > http://www.w3.org/TR/2009/NOTE-charreq-20090915/ > > > > > > First, comments on the document. > > > > 0. Section 1.2 lists a number of potential users of the > > operations defined in this document. However, it fails to say > > what the essential nature of this operation is that makes it > > applicable to these use cases. The document is defining a > > boolean function (which it calls "identity") between two > > strings, but fails to make it clear when this particular > > function is appropriate, rather than a test that makes finer > > distinctions or broader distinctions between strings. > > > > 1. Section 1.4 talks of the scope affecting aspects of the > > model that are "time-critical". It's not clear what this > > means. Does it mean operations on strings that need to be > > performed fast? Or aspects of the specification that need to > > be agreed quickly? > > > > 2. Section 2. I think it's unfortunate that the document > > speaks of string identity rather than equality or > > equivalence. In many computing contexts, two objects can be > > distinct (not identical) but yet equal. This is also true in > > some ontological models, and indeed in normal English usage: > > if I can count how many times the string "hello" appears on a > > page, this implies that those occurrences of the string > > "hello" are distinguishable and therefore have separate > > identity. Indeed, one can argue that it's nonsense to talk of > > two strings being identical: if they are identical, then > > there is only one string, not two. > > > > 3. In the heading of section 2.3, the choice of the word > > "invisible" is unfortunate, because it suggests that > > equivalence might be based on the visual appearance of > > glyphs. For example, it is hard to argue that the equivalence > > of the two encodings of ΓΌ is justified by the absence of a > > visual distinction, when the same argument is not being made > > for equivalence of the Latin, Greek, and Cyrillic letters > > that look like A. > > > > 4. Section 2.4 ("The string identity matching specification > > shall not treat as equivalent characters that can usually be > > distinguished by the user") could be used to argue that > > italic "A" should not be taken as equivalent to underlined > > "A". Let's face it: the industry has decided to treat some > > decorations of characters as part of the character code, and > > other decorations as styling information. There are no strong > > reasons to overturn those decisions, but we should remember > > that in many cases they are highly arbitrary. This is of > > course particularly true of some of the sillier Unicode > > characters such as circled or superscript digits. (Perhaps it > > would be useful to rule such debate out of order earlier in > > the document by defining "string" as a sequence of Unicode > > codepoints.) > > > > 5. Section 2.7. What do you mean by "opaque"? This section is > > very tricky. > > Are you suggesting that it should be possible to compare two > > IRIs by their visual appearance alone? That would mean that > > Greek A and Latin A are to be treated as identical. If that's > > not what's intended, then what is? How do I distinguish Greek > > A and Latin A if the encoding is opaque? > > > > 6. Section 2.9 "The string identify specification shall be > > prepared quickly". I guess the spelling error is there to > > prove that this requirement has been met. (Or perhaps to > > prove that humans are capable of detecting string identity > > where computers cannot.) > > > > 7. Section 2.10. List items 2 and 3 of this section start a > > new topic: we are no longer discussing the specification of > > whether strings are identical, we are discussing the > > engineering of systems and protocols to implement that > > specification. It would be better to align this change of > > topic with the section heading for section 3 of the document. > > > > 8. Section 3.1 states "early normalization has to be uniform, > > i.e. all components of the WWW that normalize have to do so > > in one specific way". The inference is incorrect: > > normalization only needs to be uniform for each interface or > > protocol. There is no intrinsic reason, for example, why the > > rules for email have to be the same as the rules for HTTP, or > > why the rules for HTML have to be the same as the rules for > > XML. Uniformity across interfaces/protocols may be desirable, > > but it is not essential. Experience suggests that solving the > > problem one protocol at a time may be easier than trying to > > impose a uniform solution on everyone. > > > > 9. Section 3.2 "Ideally, early uniform normalization will > > spread out from the WWW to other parts of the information > > infrastructure." Sadly, I think this is unlikely. The dual > > coding of accented characters in Unicode goes back a long way > > and stems from strongly held views as to which form is > > preferable; choosing one form over the other in a W3C > > architecture document is not going to make the quarrel disappear. > > > > 10. Section 3.3 "A wide range of text on the WWW will have to > > be normalized.". At this point, I have to say I think the > > document is disappearing into cloud cuckoo land. It would be > > better to state up-front: > > "The web is vast, much of the content it contains is never > > going to change, and many of the creators of content on the > > web are going to ignore any rules we write down. Any proposed > > solution has to take these facts into account." > > > > Second, impact on F+O (and on the semantics of operations in > > XQuery, XSLT, and XPath that are based on the operators > > defined in F+O). > > > > A. XML allows both composed and decomposed versions of > > characters. This isn't going to change - we can't make > > existing XML documents invalid. So in QT specifications, we > > have to assume both forms can exist. Talk of early > > normalization is therefore irrelevant. We could in principle > > require that in the XDM model, all strings are uniformly > > normalized. However, the run-time costs would probably be > > unacceptable to users: and see also (C) below. > > > > B. We could certainly define an equality comparison between > > strings that normalizes both string before comparison. For > > example, we could introduce a normalizing collation, with a > > standard URI, and we could mandate that (from some version of > > our specs) all processors must support this collation. We > > could also allow or require it to be the default collation. > > > > C. XSD 1.1 continues to treat the decomposed and composed > > forms of a string as not equal and not identical. XSD is not > > going to change in a hurry: it seems unlikely that there will > > be a version beyond XSD 1.1. The QT specs need to remain > > aligned with XSD. Performing implicit conversion from > > decomposed to composed form or vice versa could make data > > values invalid against the schema. In practical terms it's > > therefore a non-starter. > > > > D. The document also discusses indexing into character > > strings. So long as strings can exist in both composed and > > decomposed forms, it's hard to see how we can change our > > existing substring() function which performs such indexing. > > We could introduce a new function, but it would simply be the > > functional composition of two existing functions, > > normalize-unicode() and substring(), so there's little added > value. > > > > > > Regards, > > > > Michael Kay > > http://www.saxonica.com/ > > http://twitter.com/michaelhkay > > > > >
Received on Tuesday, 20 October 2009 14:51:17 UTC