- From: Phillips, Addison <addison@lab126.com>
- Date: Thu, 14 Mar 2013 16:07:29 +0000
- To: Norbert Lindenberg <w3@norbertlindenberg.com>, www-international <www-international@w3.org>
Hello Norbert, Thanks for this. Comments follow. Addison > -----Original Message----- > From: Norbert Lindenberg [mailto:w3@norbertlindenberg.com] > Sent: Thursday, March 14, 2013 8:23 AM > To: www-international > Cc: Norbert Lindenberg > Subject: [charmod-norm] Comments on CharmodNorm Proposal 2013 > > These comments relate to: > http://www.w3.org/International/wiki/index.php?title=CharmodNormProposal2 > 013&oldid=3451 > > 1) This document should make a clear distinction between formal (markup or > programming) language content and natural language content. The proposed > rules appear to be targeting formal language elements, where normalization > may produce unexpected and undesirable matches. For natural language > content I think normalization is usually expected, especially for search and sort > operations. I agree. I started to break these up. The focus of the parent document is on formal languages, hence the tendency here. I have a block of text about implementing natural language search and find operations (non-formal matching) to put into the main document, but haven't codified that into requirements yet. > > 2) The proposed rules allow locale-specific case folding for formal language > elements. I think the interpretation of formal language content should not > depend on locales. Agreed. I removed this from the requirement, adding an explanatory note (which may *still* be too strong). I also added an NLS processing requirement separately as a placeholder. > > 3) A number of terms need definitions, either in-line or by reference: "ASCII- > only case-sensitive" through "Unicode case-insensitive locale-specific case- > folding". Agreed, although that's really the job of Charmod-Norm's body text. These are only the requirements. Admittedly the shorthand makes this document less accessible. > > 4) Instead of "ASCII" or "US-ASCII", use the Unicode block name "Basic Latin". Meh. "Basic Latin" is less accessible to most users than ASCII is. I edited the occurrences to "Basic Latin (ASCII)" > > 5) The references to security issues need more detail on what kind of issues > might occur. See http://inter-locale.com/w3c/charmod-norm-1.1-draft.html > > 6) Remove "byte-by-byte"; "code unit by code unit" is the correct phrase to use. > Also, the Unicode standard uses "code point", not "codepoint". "byte-by-byte" was the original text. My text actually says: ... if a specific Unicode character encoding is specified, "byte-by-byte" (or rather code unit-by-code unit) comparison of the sequences. I dithered about actually removing byte-by-byte for accessibility reasons, but have made the change now. Fixed code point. > > 7) The mention of "private agreements" seems to refer to the idea of > constructing larger systems (e.g., a search engine) so that text is normalized on > entry into the system, and all system components afterwards can rely on > normalization. This probably deserves more comprehensive coverage. > Those rules are the original ones. We might just remove them and just refer to Charmod-Norm 1.0?
Received on Thursday, 14 March 2013 16:08:01 UTC