W3C home > Mailing lists > Public > www-international@w3.org > October to December 2009

RE: Character model for the web: string identity and string indexing

From: Phillips, Addison <addison@amazon.com>
Date: Tue, 20 Oct 2009 07:50:38 -0700
To: Michael Kay <mike@saxonica.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <C7A5719F1E562149BA9171F58BEE2CA4129845FA74@EX-IAD6-B.ant.amazon.com>
Hello Michael,

Thank you for the comments.

The charreq document is mainly of historical interest, as you note. I (or others on the WG) may have some response to the individual comments, but my first response would be: read all three parts of CharMod instead. These documents are the response to these requirements and are far more valuable and useful than this document.

   http://www.w3.org/TR/charmod

   http://www.w3.org/TR/charmod-norm/

   http://www.w3.org/TR/charmod-resid/


Best Regards,

Addison

Addison Phillips
Globalization Architect -- Lab126

Internationalization is not a feature.
It is an architecture.


> -----Original Message-----
> From: www-international-request@w3.org [mailto:www-international-
> request@w3.org] On Behalf Of Michael Kay
> Sent: Tuesday, October 20, 2009 2:44 AM
> To: www-international@w3.org
> Subject: Character model for the web: string identity and string
> indexing
> 
> 
> I realise that the document http://www.w3.org/TR/2009/NOTE-charreq-

> 20090915/
> is largely historical, however I took the opportunity to read
> through it to
> see whether it formed useful input for development of the QT
> specifications,
> and with the endorsement of the joint XSL and XQuery working groups
> I am
> submitting the following comments for the record.
> 
> Michael Kay
> Saxonica
> 
> > -----Original Message-----
> > From: w3c-xsl-query-request@w3.org
> > [mailto:w3c-xsl-query-request@w3.org] On Behalf Of Michael Kay
> > Sent: 06 October 2009 23:57
> > To: w3c-xsl-query@w3.org
> > Subject: Character model for the web: string identity
> >
> > Action A-412-02 Mike Kay to review the Requirements for
> > String Identity Matching and String Indexing note to see if
> > it has any impact on FO.
> >
> > http://www.w3.org/TR/2009/NOTE-charreq-20090915/

> >
> >
> > First, comments on the document.
> >
> > 0. Section 1.2 lists a number of potential users of the
> > operations defined in this document. However, it fails to say
> > what the essential nature of this operation is that makes it
> > applicable to these use cases. The document is defining a
> > boolean function (which it calls "identity") between two
> > strings, but fails to make it clear when this particular
> > function is appropriate, rather than a test that makes finer
> > distinctions or broader distinctions between strings.
> >
> > 1. Section 1.4 talks of the scope affecting aspects of the
> > model that are "time-critical". It's not clear what this
> > means. Does it mean operations on strings that need to be
> > performed fast? Or aspects of the specification that need to
> > be agreed quickly?
> >
> > 2. Section 2. I think it's unfortunate that the document
> > speaks of string identity rather than equality or
> > equivalence. In many computing contexts, two objects can be
> > distinct (not identical) but yet equal. This is also true in
> > some ontological models, and indeed in normal English usage:
> > if I can count how many times the string "hello" appears on a
> > page, this implies that those occurrences of the string
> > "hello" are distinguishable and therefore have separate
> > identity. Indeed, one can argue that it's nonsense to talk of
> > two strings being identical: if they are identical, then
> > there is only one string, not two.
> >
> > 3. In the heading of section 2.3, the choice of the word
> > "invisible" is unfortunate, because it suggests that
> > equivalence might be based on the visual appearance of
> > glyphs. For example, it is hard to argue that the equivalence
> > of the two encodings of ΓΌ is justified by the absence of a
> > visual distinction, when the same argument is not being made
> > for equivalence of the Latin, Greek, and Cyrillic letters
> > that look like A.
> >
> > 4. Section 2.4 ("The string identity matching specification
> > shall not treat as equivalent characters that can usually be
> > distinguished by the user") could be used to argue that
> > italic "A" should not be taken as equivalent to underlined
> > "A". Let's face it: the industry has decided to treat some
> > decorations of characters as part of the character code, and
> > other decorations as styling information. There are no strong
> > reasons to overturn those decisions, but we should remember
> > that in many cases they are highly arbitrary. This is of
> > course particularly true of some of the sillier Unicode
> > characters such as circled or superscript digits. (Perhaps it
> > would be useful to rule such debate out of order earlier in
> > the document by defining "string" as a sequence of Unicode
> > codepoints.)
> >
> > 5. Section 2.7. What do you mean by "opaque"? This section is
> > very tricky.
> > Are you suggesting that it should be possible to compare two
> > IRIs by their visual appearance alone? That would mean that
> > Greek A and Latin A are to be treated as identical. If that's
> > not what's intended, then what is? How do I distinguish Greek
> > A and Latin A if the encoding is opaque?
> >
> > 6. Section 2.9 "The string identify specification shall be
> > prepared quickly". I guess the spelling error is there to
> > prove that this requirement has been met. (Or perhaps to
> > prove that humans are capable of detecting string identity
> > where computers cannot.)
> >
> > 7. Section 2.10. List items 2 and 3 of this section start a
> > new topic: we are no longer discussing the specification of
> > whether strings are identical, we are discussing the
> > engineering of systems and protocols to implement that
> > specification. It would be better to align this change of
> > topic with the section heading for section 3 of the document.
> >
> > 8. Section 3.1 states "early normalization has to be uniform,
> > i.e. all components of the WWW that normalize have to do so
> > in one specific way". The inference is incorrect:
> > normalization only needs to be uniform for each interface or
> > protocol. There is no intrinsic reason, for example, why the
> > rules for email have to be the same as the rules for HTTP, or
> > why the rules for HTML have to be the same as the rules for
> > XML. Uniformity across interfaces/protocols may be desirable,
> > but it is not essential. Experience suggests that solving the
> > problem one protocol at a time may be easier than trying to
> > impose a uniform solution on everyone.
> >
> > 9. Section 3.2 "Ideally, early uniform normalization will
> > spread out from the WWW to other parts of the information
> > infrastructure." Sadly, I think this is unlikely. The dual
> > coding of accented characters in Unicode goes back a long way
> > and stems from strongly held views as to which form is
> > preferable; choosing one form over the other in a W3C
> > architecture document is not going to make the quarrel disappear.
> >
> > 10. Section 3.3 "A wide range of text on the WWW will have to
> > be normalized.". At this point, I have to say I think the
> > document is disappearing into cloud cuckoo land. It would be
> > better to state up-front:
> > "The web is vast, much of the content it contains is never
> > going to change, and many of the creators of content on the
> > web are going to ignore any rules we write down. Any proposed
> > solution has to take these facts into account."
> >
> > Second, impact on F+O (and on the semantics of operations in
> > XQuery, XSLT, and XPath that are based on the operators
> > defined in F+O).
> >
> > A. XML allows both composed and decomposed versions of
> > characters. This isn't going to change - we can't make
> > existing XML documents invalid. So in QT specifications, we
> > have to assume both forms can exist. Talk of early
> > normalization is therefore irrelevant. We could in principle
> > require that in the XDM model, all strings are uniformly
> > normalized. However, the run-time costs would probably be
> > unacceptable to users: and see also (C) below.
> >
> > B. We could certainly define an equality comparison between
> > strings that normalizes both string before comparison. For
> > example, we could introduce a normalizing collation, with a
> > standard URI, and we could mandate that (from some version of
> > our specs) all processors must support this collation. We
> > could also allow or require it to be the default collation.
> >
> > C. XSD 1.1 continues to treat the decomposed and composed
> > forms of a string as not equal and not identical. XSD is not
> > going to change in a hurry: it seems unlikely that there will
> > be a version beyond XSD 1.1. The QT specs need to remain
> > aligned with XSD. Performing implicit conversion from
> > decomposed to composed form or vice versa could make data
> > values invalid against the schema. In practical terms it's
> > therefore a non-starter.
> >
> > D. The document also discusses indexing into character
> > strings. So long as strings can exist in both composed and
> > decomposed forms, it's hard to see how we can change our
> > existing substring() function which performs such indexing.
> > We could introduce a new function, but it would simply be the
> > functional composition of two existing functions,
> > normalize-unicode() and substring(), so there's little added
> value.
> >
> >
> > Regards,
> >
> > Michael Kay
> > http://www.saxonica.com/

> > http://twitter.com/michaelhkay

> >
> >
> 

Received on Tuesday, 20 October 2009 14:51:17 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 20 October 2009 14:51:17 GMT