RE: comments on Character Model for the World Wide Web: String Matching and Searching from Phillips, Addison on 2014-06-19 (www-international@w3.org from April to June 2014)

From: Phillips, Addison <addison@lab126.com>
Date: Thu, 19 Jun 2014 16:29:01 +0000
To: Matitiahu Allouche <matitiahu.allouche@gmail.com>, "www-international@w3.org" <www-international@w3.org>
Message-ID: <7C0AF84C6D560544A17DDDEB68A9DFB5246F9D1E@ex10-mbx-36009.ant.amazon.com>
Replies follow. Thank you for these comments.

From: Matitiahu Allouche [mailto:matitiahu.allouche@gmail.com]
Sent: Thursday, June 19, 2014 5:23 AM
To: www-international@w3.org
Subject: comments on Character Model for the World Wide Web: String Matching and Searching

These are my comments on chapters 1 and 2 of the subject document ( http://www.w3.org/International/docs/charmod-norm/ ).


1)      In 1.3, we find "The policies adopted by the IETF for on the use of character sets on the Internet are documented in [[RFC 2277]]." It seems that the word "for" should be removed.

AP> Done.

2)      In 2, we find "such as those define". It probably should be "such as those which define".

AP> Done. The full sentence (for those playing at home) was: “This is particularly important when defining a "formal language", such as those define document formats for the Web.”

3)      In 2, we find "implementations and tools need to consider the difficulties experienced by users who expect that visually and logically equivalent strings that "ought to" match but are considered to be distinct values and provide a means for users to see these differences and/or normalize them as appropriate." The phrase whose subject is "visually and logically equivalent strings" has no verb. It would probably be better to split the long sentence into shorter ones.

AP> … and you didn’t quote the whole sentence either! I rewrote as:

--

It may be difficult for users to assure that a given Wildebeest or set of Wildebeests uses a consistent textual representation. Tools and implementations need to consider the difficulties experienced by users when visually or logically equivalent strings that "ought to" match (in the user's mind) are considered to be distinct values. Providing a means for users to see these differences and/or normalize them as appropriate makes it possible for end users to avoid failures that spring from invisible differences in their source documents. For example, the W3C Validator warns when an HTML document is not fully in Unicode Normalization Form C.

--

4) In 2.1, "the hexadecimal entity &20ac;" should be "the hexadecimal entity &#x20ac;".

AP> Fixed.

4)      In 2.2, "in different ways that is" should be "in different ways that are".

AP> Good catch, but I fixed by re-writing this wordy sentence as:

--

These illustrate that there are sometimes many different ways that visually or logically identical texts can be encoded in Unicode (and thus document formats or protocols that use Unicode).

--

6) In 2.2 table of Canonical Equivalence, what happens for Hangul is not clear for someone not familiar with Hangul; what the Singleton line demonstrates is not clear to me.

AP> The table in 2.2 is taken directly from Unicode. It should include the code point values. An explanation of Hangul’s complexity isn’t directly germane to the text, but probably bears mentioning. I have, for now, introduced an ednote here.
7) In 2.2 table of Compatibility Equivalence, the example for Breaking differences shows only an hyphen in the second column. Maybe I am missing proper fonts, but I am surely missing the message.

AP> This is also taken from UAX#15. That’s a pretty weird example, since it appears to be marking the equivalence of U+FE63 and U+FF0D to hyphen-minus? I can’t find specific mention or examples of “breaking differences”. What *IS* real (and is described in TUS Chapter 2) is the compatibility equivalence of U+00A0 (non-breaking space) to the regular U+0020 space character: I think this is what is intended here (since it is a breaking difference). I think I’ll submit a comment on that ☺.

8) Ibidem, I think that each example line should show at least 2 symbols which are deemed equivalent. I see only one symbol in the lines for Circled, Squared Characters, Fractions, Others. This is puzzling.

AP> I agree. I added a ednote for now.

9) In 2.3, "A different form of text normalization that can applied" => "that can be applied".

AP> Done.

10) In 2.3, we find "Case-insensitive matching is sometimes useful in contexts where case may vary in a way that is not semantically meaningful or in which case distinctions cannot be controlled by the user." I think that it is not "sometimes useful" but "often useful", maybe "most often useful". Most of the time, my searches in Latin text are, and must be, case insensitive.
AP> I agree. I modified slightly and then added more text as:
--
Case-insensitive matching is useful in contexts where case may vary in a way that is not semantically meaningful or in which case distinctions cannot be controlled by the user. This is particularly true when <a href="#searching">searching</a> a document, but also applies when defining rules for matching user- or content-generated values, such as identifiers.
--

--
Shalom (Regards),  Mati
Received on Thursday, 19 June 2014 16:29:32 UTC