General

Numbers for each conformance requirements clause would greatly aid referencing them.

Please at least link to an accessible representation of "foreign" characters rather than merely providing raster images of them. The text of this specification does not conform to itself, since it iuses bytes (pixels) to represent Unicode characters. Its also less than optimal wrt WAI guidelines. Apendix B is a lot better. But if the concern is to ensure correct rendering on legacy browsers, at least provide a link to the actual unicode sample, as characters and markup.

Much of this document is a statement of existing good design practice. Many existing W3C specifications implement large parts of it. TYhis is good. Care should be taken with MUSTs which make W3C Recs non-conforming. For example, XML 1.0 and 1.1 are non conforming.

These sections (collectively "character 101"): 
3 Characters. 
5 Compatibility and Formatting Characters. 
6 String Identity Matching. 
7 String Indexing 
9 Referencing the Unicode Standard and ISO/IEC 10646

taken as a group, are great, in general, and should be collected together with appropritate intrioductory and reference material as a separate document and move to Proposed Rec once it exits Last Call. There is already a large body of existing implementation of these concepts in W3C Recs.

Section 4 Early Uniform Normalization is very important, but affects a lot of specifications and needs, I believe, a CR period as does section 8 Character Encoding in URI References (perhaps - there is some exprerience for the latter). Thus, I suggest splitting the document so that sections 4 and 8 can move to CR without delaying sections 3,5,6,7,9 which are needed as a Rec ASAP!

The portions about collation and sorting (for example 3.1.5 Units of collation) are sparse, vague, and anecdotal which contrasts strangely with the MUSTs; this section should be removed and returned for  further work to produce a separate architectural specification on collation that has crisp, well thought out conformance criteria. The maturity of the collation parts does not match that of the "character 101", normalization and URI reference parts.


3.1.3 Units of visual rendering

"Logical selection looks like this:"

There should be a requirement after that

[S][I] Specifications of protocols and APIs that involve selection of ranges MUST provide for contiguous logical selections.

Having defined the terms "logical selection mode" and "visual selection mode", please use them rather than the highly ambiguous "discontiguous selections" and "contiguous selections", so in fact that should be

[S][I] Specifications of protocols and APIs that involve selection of ranges MUST provide for text selection in logical selection mode.

Also, should there not be something about copying that selection and pasting it somewhere else, that what you get is the logical selection?

Similarly in the next part, I suggest rewording to remove the ambiguous phrase:

[S] Specifications of protocols and APIs that involve selection of ranges SHOULD provide for text selection in logical selection mode, at least to the extent necessary to support implementation of visual selection on screen on top of those protocols and APIs.

Its not clear that this is such a strong requirement and it complicates processing, especially on handheld devices. Perhaps weaken to MAY? And say what happens when this funky visual selection getc copied and pasted - do you get a set of separate logical selections (if so how delimited)? A single visually ordered selection (yuk)? Something else?

Otherwise, the weaker requirement for contiguous visual selection is likely to merely encourage the use of visual storage or the disposal of logical storage once the visual result has been generated. Which would lead to text copied from visualy contiguous (logically discontiguous) selections being stored in visual order. Which is to be avoided.

It would be a good idea to tie into WAI concerns by noting that accessibility tools, which access the DOM, should be able to get at logically ordered text and to know which parts are selected.


3.5

"[S ] Specifications MUST be defined in terms of Unicode characters, not bytes or glyphs."

Yes in general, but not exclusively. The MUST should be retainmed, but the scope of the statement tightened up. Specifications *when they talk about characters* MUST be defined in terms of Unicode characters, not bytes or glyphs. Specifications are allowed to talk about bytes if that is what they are representing (eg, PNG which defines a byte stream) or glyphs (for example the SVG glyph element, which is very clearly defined as a glyph because that is its purpose in life. Although its 'unicode' attribute is, indeed, defined in terms of (a string of) Unicode characters.

"[S] Specifications SHOULD allow the use of the full range of Unicode code points from U+0000 to U+10FFFF inclusive; code points above U+10FFFF MUST NOT be used."

So XML is not conforming, since it disallows for example U+0000 ?


3.6.2

"[S] If the unique encoding approach is not chosen, specifications MUST designate at least one of the UTF-8 and UTF-16 encoding forms of Unicode as admissible encodings and SHOULD choose at least one of UTF-8 or UTF-16 as mandated encoding forms (encoding forms that MUST be supported by implementations of the specification)."

Does that mean that, for example, saying UTF-8 is allowed and UTF-16 is disallowed and an encoding declaration is not required, is okay?

Needs a little more on encodings that are a group of similar but not identical encodings, for example shift-jis.

"Because of the layered Web architecture (e.g. formats used over protocols), there may be multiple and at times conflicting information about character encoding. [S] Specifications MUST define conflict-resolution mechanisms (e.g. priorities) for cases where there is multiple or conflicting information about character encoding." 

Yes. Better though to not define such layering; the XML MIME RFC messed this up by allowing the charset and the xml encoding declaration to differ and for the former to take precedence; this requires "save as" to rewrite the XML otherwise it is no longer well formed.... better to require any transcoders to leave XML alone or to know how to rewrite the encoding declaration if they change the encoding.

"Certain encodings are more or less associated with certain languages (e.g. Shift-JIS with Japanese); trying to support a given language or set of customers may mean that certain encodings have to be supported."

The corollary should be clearly stated: do not assume that "everyone" supports a favored but non-mandated encoding "every parser I know supports Latin-1/Shift-JIS" is not true.

3.6.3 contradictory 

"[S] Specifications SHOULD NOT provide mechanisms for agreement on private use code points between parties and MUST NOT require the use of such mechanisms. "

svg glyph with a unicode="&#xFE00;" is that a private agreement (aand hence in contravention)? If you disallow it, though, you break the following 

"[S] [I] Specifications and implementations SHOULD be designed in such a way as to not disallow the use of private use code points by private arrangement."

and in practice, duisallowing it would merely encourage mapping glyphs to the ascii code range wheras they should use the correct unicode code point or, if none, the PUA. Related point, avoid using character mechanisms for
things that are not characters ("pi" fonts). Use small inline graphics instead.

3.7

"[S] Escaped characters SHOULD be acceptable wherever unescaped characters are; this does not preclude that a syntax-significant character, when escaped, loses its significance in the syntax. In particular, escaped characters SHOULD be acceptable in identifiers and comments."

XML should allow NCRs everywhere, for example inside element and attribute names?


8 Character Encoding in URI References

"[S] W3C specifications MUST define when the conversion from IRI references to URI references (or subsets thereof) takes place, in accordance with Internationalized Resource Identifiers (IRI) [I-D IRI]."

Why not go further and say that the IRI form is used in the document instance and the hexified URI form when it goes over the wire? It would be bad if different XML namespaces defined different processing here.

9 Referencing the Unicode Standard and ISO/IEC 10646

"Conformance to Unicode implies conformance to ISO/IEC 10646, see [Unicode 3.0] Appendix C.

[S] Since specifications in general need both a definition for their characters and the semantics associated with these characters, specifications SHOULD include a reference to the Unicode Standard, whether or not they include a reference to ISO/IEC 10646. By providing a reference to The Unicode Standard implementers can benefit from the wealth of information provided in the standard and on the Unicode Consortium Web site."

That is a bit weak. Say explicitly that a reference to 10646 without a reference to Unicode implies no character semantics, no bidi processing no character case information etc etc. Also, since one is a strict superset of the other, provide a rationale why a specification should ever provide a reference to 10646 since a reference to Unicode exactly covers the same CCS?