Comments on charmod draft 2/20/2002

I reviewed WD-charmod-20020220 and would like to post some comments.

3.5 Reference Processing Model

- "... Unicode code points from U+0 to U+0FFFF inclusive; ..."

  U+0FFFF is typo for U+10FFFF

- In the first Note in this section, it says "All specifications that
  derive from the XML 1.0 specification [XML 1.0] automatically
  inherit this Reference Processing Model." But XML 1.0 is not very
  good example because it doesn't allow the use of the full range of
  Unicode code points and it doesn't justify the exceptions.

3.6.1 Mandating a unique character encoding

- "There is also no ambiguity if data is transferred
  non-electronically and later has to be converted back to a digital
  representation."

  If "transferred non-electronically" means that characters are
  written on paper, there are a lot of ambiguity to determine
  characters from glyph, like if this space is SPACE U+0020 or
  NO-BREAK SPACE U+00A0.

3.6.2 Character Encoding Identification

- In the fourth Note, there is a type "identifers".

- "[S] Specifications MUST NOT use heuristics to determine the
  encoding of data."

  In what situation, would specifications "determine" the encoding of
  data?

3.7 Character Escaping

- In the first paragraph, two terms "character data" and "text data"
  appear, which seem to mean the same thing. It would be better to use
  either one of the term consistently.
  
- "[S] Explicit end delimiters MUST be provided. Escapes such as
  \uABCD where the end delimiter is a space or any character other
  than [01-9A-F] SHOULD be avoided."

  MUST and SHOULD are mixed here. If the first requirement is MUST,
  the second must be also MUST.


4.3 Responsibility for Normalization

- "[S] [I] A text-processing component that receives suspect text MUST
  NOT perform any normalization-sensitive operations unless it has
  first successfully validated the text for normalization, and MUST
  NOT normalize the suspect text."

  I understand that some application such as XML processor MUST NOT
  normalize the suspect text because the normalization can turn a
  well-formed document to ill-formed. On the other hand, some
  application such as search engine SHOULD normalize text so that it
  can find canonically equivalent text.


-------------------
Shigemichi Yazawa
yazawa@globalsight.com

Received on Tuesday, 12 March 2002 16:25:43 UTC