[charmod-norm] Comments on CharmodNorm Proposal 2013

These comments relate to:
http://www.w3.org/International/wiki/index.php?title=CharmodNormProposal2013&oldid=3451

1) This document should make a clear distinction between formal (markup or programming) language content and natural language content. The proposed rules appear to be targeting formal language elements, where normalization may produce unexpected and undesirable matches. For natural language content I think normalization is usually expected, especially for search and sort operations.

2) The proposed rules allow locale-specific case folding for formal language elements. I think the interpretation of formal language content should not depend on locales.

3) A number of terms need definitions, either in-line or by reference: "ASCII-only case-sensitive" through "Unicode case-insensitive locale-specific case-folding".

4) Instead of "ASCII" or "US-ASCII", use the Unicode block name "Basic Latin".

5) The references to security issues need more detail on what kind of issues might occur.

6) Remove "byte-by-byte"; "code unit by code unit" is the correct phrase to use. Also, the Unicode standard uses "code point", not "codepoint".

7) The mention of "private agreements" seems to refer to the idea of constructing larger systems (e.g., a search engine) so that text is normalized on entry into the system, and all system components afterwards can rely on normalization. This probably deserves more comprehensive coverage.

Norbert

Received on Thursday, 14 March 2013 15:23:41 UTC