Re: Hex NCRs in generated XML: nice, but hardly essential.

[short summary of the main response points from a private discussion]

At 01:17 06/03/28, Chris Lilley wrote:
 >
 >Hello www-i18n-comments,
 >
 >In Character Model for the World Wide Web 1.0: Fundamentals
 >
 >we read:
 >
 >http://www.w3.org/TR/2005/REC-charmod-20050215/#C043
 >
 >  C043 [S] The number of different ways to escape a character SHOULD be
 >  minimized (ideally to one).
 >
 >  A well-known counter-example is that for historical reasons, both HTML
 >  and XML have redundant decimal (&#ddddd;) and hexadecimal (&#xhhhh;)
 >  character escapes.
 >
 >Yes. Given that XML does, as noted, have both of them, we find that
 >
 >http://www.w3.org/TR/2005/REC-charmod-20050215/#C048
 >
 >  C048 [I] [C] Content SHOULD use the hexadecimal form of character
 >  escapes rather than the decimal form when there are both.
 >
 >  NOTE: The hexadecimal form is preferred because character encoding
 >  standards (in particular Unicode) usually list character numbers as
 >  hexadecimal, making lookup easier.
 >
 >to be overly strong.

Why? There is only MUST, SHOULD, and MAY, and a MAY wouldn't make
sense in this case.
[When working on the document, we often used terms like "strong SHOULD"
and "weak SHOULD", but those don't actually exist, and so the distinction
is left to the common sense of the reader (basing his judgement on the
whole of the document rather than on a single sentence).]

 >Its certainly sound advice for hand authors, and a
 >content creation tool might well be coded up to choose hex rather than
 >decimal escapes, since it makes no particular difference which to use.

Yes indeed. That's just about what we had in mind when we wrote
that part of the spec.

 >Requiring all content to use hex NCRs, though, seems rather strong.

A SHOULD does not require anything. The equivalent of REQUIRED in
RFC 2119 terms is MUST, and the equivalent of SHOULD is RECOMMENDED.

 >Saying that software which emits XML does not conform because it allows
 >decimal NCRs to be generated is also overly strong - fair enough for
 >NCRs that are machine generated, but if the author put them in then
 >software has no real business changing them.

That may depend on the software. Anything working on the Infoset level
will just forget whether the original was NCR (hexadecimal or decimal)
or the actual character.

But keeping things the way a human author put it is certainly a good
reason for not observing the SHOULD in C048 in editor-like tools.
And that's exactly what a SHOULD is all about: Something you do
except if you have a good reason not to do it.

 >It slightly increases readability (though not as much as using the actual
 >character does),

Yes, that's what C047 is about. Definitely use the real character
if you can.

 >but so does a two-character indent or other forms of
 >pretty printing.

Yes, but these are obviously not topics of the Character Model.


Regards,    Martin. 

Received on Tuesday, 28 March 2006 12:38:34 UTC