RE: [charmod-norm] Comments on CharmodNorm Proposal 2013

Hello Norbert,

Thanks for this. Comments follow.

Addison

> -----Original Message-----
> From: Norbert Lindenberg [mailto:w3@norbertlindenberg.com]
> Sent: Thursday, March 14, 2013 8:23 AM
> To: www-international
> Cc: Norbert Lindenberg
> Subject: [charmod-norm] Comments on CharmodNorm Proposal 2013
> 
> These comments relate to:
> http://www.w3.org/International/wiki/index.php?title=CharmodNormProposal2

> 013&oldid=3451
> 
> 1) This document should make a clear distinction between formal (markup or
> programming) language content and natural language content. The proposed
> rules appear to be targeting formal language elements, where normalization
> may produce unexpected and undesirable matches. For natural language
> content I think normalization is usually expected, especially for search and sort
> operations.

I agree. I started to break these up. The focus of the parent document is on formal languages, hence the tendency here. I have a block of text about implementing natural language search and find operations (non-formal matching) to put into the main document, but haven't codified that into requirements yet.

> 
> 2) The proposed rules allow locale-specific case folding for formal language
> elements. I think the interpretation of formal language content should not
> depend on locales.

Agreed. I removed this from the requirement, adding an explanatory note (which may *still* be too strong). I also added an NLS processing requirement separately as a placeholder.

> 
> 3) A number of terms need definitions, either in-line or by reference: "ASCII-
> only case-sensitive" through "Unicode case-insensitive locale-specific case-
> folding".

Agreed, although that's really the job of Charmod-Norm's body text. These are only the requirements. Admittedly the shorthand makes this document less accessible.

> 
> 4) Instead of "ASCII" or "US-ASCII", use the Unicode block name "Basic Latin".

Meh. "Basic Latin" is less accessible to most users than ASCII is. I edited the occurrences to "Basic Latin (ASCII)"
> 
> 5) The references to security issues need more detail on what kind of issues
> might occur.

See http://inter-locale.com/w3c/charmod-norm-1.1-draft.html 
> 
> 6) Remove "byte-by-byte"; "code unit by code unit" is the correct phrase to use.
> Also, the Unicode standard uses "code point", not "codepoint".

"byte-by-byte" was the original text. My text actually says:
 
... if a specific Unicode character encoding is specified, "byte-by-byte" (or rather code unit-by-code unit) comparison of the sequences.

I dithered about actually removing byte-by-byte for accessibility reasons, but have made the change now.

Fixed code point.
> 
> 7) The mention of "private agreements" seems to refer to the idea of
> constructing larger systems (e.g., a search engine) so that text is normalized on
> entry into the system, and all system components afterwards can rely on
> normalization. This probably deserves more comprehensive coverage.
> 
Those rules are the original ones. We might just remove them and just refer to Charmod-Norm 1.0?

Received on Thursday, 14 March 2013 16:08:01 UTC