Character model for the web: string identity and string indexing from Michael Kay on 2009-10-20 (www-international@w3.org from October to December 2009)

From: Michael Kay <mike@saxonica.com>
Date: Tue, 20 Oct 2009 10:44:19 +0100
To: <www-international@w3.org>
Message-ID: <1BA313AAFFD64BB8A5B21B370DA58A76@Sealion>
 
I realise that the document http://www.w3.org/TR/2009/NOTE-charreq-20090915/
is largely historical, however I took the opportunity to read through it to
see whether it formed useful input for development of the QT specifications,
and with the endorsement of the joint XSL and XQuery working groups I am
submitting the following comments for the record.

Michael Kay
Saxonica

> -----Original Message-----
> From: w3c-xsl-query-request@w3.org 
> [mailto:w3c-xsl-query-request@w3.org] On Behalf Of Michael Kay
> Sent: 06 October 2009 23:57
> To: w3c-xsl-query@w3.org
> Subject: Character model for the web: string identity
> 
> Action A-412-02 Mike Kay to review the Requirements for 
> String Identity Matching and String Indexing note to see if 
> it has any impact on FO.
> 
> http://www.w3.org/TR/2009/NOTE-charreq-20090915/
> 
> 
> First, comments on the document. 
> 
> 0. Section 1.2 lists a number of potential users of the 
> operations defined in this document. However, it fails to say 
> what the essential nature of this operation is that makes it 
> applicable to these use cases. The document is defining a 
> boolean function (which it calls "identity") between two 
> strings, but fails to make it clear when this particular 
> function is appropriate, rather than a test that makes finer 
> distinctions or broader distinctions between strings. 
> 
> 1. Section 1.4 talks of the scope affecting aspects of the 
> model that are "time-critical". It's not clear what this 
> means. Does it mean operations on strings that need to be 
> performed fast? Or aspects of the specification that need to 
> be agreed quickly?
> 
> 2. Section 2. I think it's unfortunate that the document 
> speaks of string identity rather than equality or 
> equivalence. In many computing contexts, two objects can be 
> distinct (not identical) but yet equal. This is also true in 
> some ontological models, and indeed in normal English usage: 
> if I can count how many times the string "hello" appears on a 
> page, this implies that those occurrences of the string 
> "hello" are distinguishable and therefore have separate 
> identity. Indeed, one can argue that it's nonsense to talk of 
> two strings being identical: if they are identical, then 
> there is only one string, not two.
> 
> 3. In the heading of section 2.3, the choice of the word 
> "invisible" is unfortunate, because it suggests that 
> equivalence might be based on the visual appearance of 
> glyphs. For example, it is hard to argue that the equivalence 
> of the two encodings of ü is justified by the absence of a 
> visual distinction, when the same argument is not being made 
> for equivalence of the Latin, Greek, and Cyrillic letters 
> that look like A.
> 
> 4. Section 2.4 ("The string identity matching specification 
> shall not treat as equivalent characters that can usually be 
> distinguished by the user") could be used to argue that 
> italic "A" should not be taken as equivalent to underlined 
> "A". Let's face it: the industry has decided to treat some 
> decorations of characters as part of the character code, and 
> other decorations as styling information. There are no strong 
> reasons to overturn those decisions, but we should remember 
> that in many cases they are highly arbitrary. This is of 
> course particularly true of some of the sillier Unicode 
> characters such as circled or superscript digits. (Perhaps it 
> would be useful to rule such debate out of order earlier in 
> the document by defining "string" as a sequence of Unicode 
> codepoints.)
> 
> 5. Section 2.7. What do you mean by "opaque"? This section is 
> very tricky.
> Are you suggesting that it should be possible to compare two 
> IRIs by their visual appearance alone? That would mean that 
> Greek A and Latin A are to be treated as identical. If that's 
> not what's intended, then what is? How do I distinguish Greek 
> A and Latin A if the encoding is opaque?
> 
> 6. Section 2.9 "The string identify specification shall be 
> prepared quickly". I guess the spelling error is there to 
> prove that this requirement has been met. (Or perhaps to 
> prove that humans are capable of detecting string identity 
> where computers cannot.)
> 
> 7. Section 2.10. List items 2 and 3 of this section start a 
> new topic: we are no longer discussing the specification of 
> whether strings are identical, we are discussing the 
> engineering of systems and protocols to implement that 
> specification. It would be better to align this change of 
> topic with the section heading for section 3 of the document.
> 
> 8. Section 3.1 states "early normalization has to be uniform, 
> i.e. all components of the WWW that normalize have to do so 
> in one specific way". The inference is incorrect: 
> normalization only needs to be uniform for each interface or 
> protocol. There is no intrinsic reason, for example, why the 
> rules for email have to be the same as the rules for HTTP, or 
> why the rules for HTML have to be the same as the rules for 
> XML. Uniformity across interfaces/protocols may be desirable, 
> but it is not essential. Experience suggests that solving the 
> problem one protocol at a time may be easier than trying to 
> impose a uniform solution on everyone.
> 
> 9. Section 3.2 "Ideally, early uniform normalization will 
> spread out from the WWW to other parts of the information 
> infrastructure." Sadly, I think this is unlikely. The dual 
> coding of accented characters in Unicode goes back a long way 
> and stems from strongly held views as to which form is 
> preferable; choosing one form over the other in a W3C 
> architecture document is not going to make the quarrel disappear.
> 
> 10. Section 3.3 "A wide range of text on the WWW will have to 
> be normalized.". At this point, I have to say I think the 
> document is disappearing into cloud cuckoo land. It would be 
> better to state up-front:
> "The web is vast, much of the content it contains is never 
> going to change, and many of the creators of content on the 
> web are going to ignore any rules we write down. Any proposed 
> solution has to take these facts into account."
> 
> Second, impact on F+O (and on the semantics of operations in 
> XQuery, XSLT, and XPath that are based on the operators 
> defined in F+O).
> 
> A. XML allows both composed and decomposed versions of 
> characters. This isn't going to change - we can't make 
> existing XML documents invalid. So in QT specifications, we 
> have to assume both forms can exist. Talk of early 
> normalization is therefore irrelevant. We could in principle 
> require that in the XDM model, all strings are uniformly 
> normalized. However, the run-time costs would probably be 
> unacceptable to users: and see also (C) below.
> 
> B. We could certainly define an equality comparison between 
> strings that normalizes both string before comparison. For 
> example, we could introduce a normalizing collation, with a 
> standard URI, and we could mandate that (from some version of 
> our specs) all processors must support this collation. We 
> could also allow or require it to be the default collation.
> 
> C. XSD 1.1 continues to treat the decomposed and composed 
> forms of a string as not equal and not identical. XSD is not 
> going to change in a hurry: it seems unlikely that there will 
> be a version beyond XSD 1.1. The QT specs need to remain 
> aligned with XSD. Performing implicit conversion from 
> decomposed to composed form or vice versa could make data 
> values invalid against the schema. In practical terms it's 
> therefore a non-starter.
> 
> D. The document also discusses indexing into character 
> strings. So long as strings can exist in both composed and 
> decomposed forms, it's hard to see how we can change our 
> existing substring() function which performs such indexing. 
> We could introduce a new function, but it would simply be the 
> functional composition of two existing functions, 
> normalize-unicode() and substring(), so there's little added value.
> 
> 
> Regards,
> 
> Michael Kay
> http://www.saxonica.com/
> http://twitter.com/michaelhkay 
> 
>
Received on Tuesday, 20 October 2009 09:44:56 UTC