- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Thu, 28 Feb 2013 12:01:15 +0900
- To: "Costello, Roger L." <costello@mitre.org>
- CC: "www-international@w3.org" <www-international@w3.org>
Hello Roger, On 2013/02/28 8:03, Costello, Roger L. wrote: > Hi Folks, > > In the document "Character Model for the World Wide Web 1.0: Normalization" it says this at the bottom of section 3.3.3: > > With appropriate entity definitions, instead of A´, > writeÁ Just while we are at it, this is because Á will be in NFC when the entity reference is resolved, but A´ will not be in NFC. > (or better, use 'Á' directly). > > The statement in parenthesis is particularly intriguing. Is it suggesting that Best Practice is to write this: > > <Name>Ándre</Name> > > rather than this: > > <Name>&xC1;ndre</Name> > > where&xC1; is the character entity reference for Á. > > Why is the former preferred over the latter? In HTML and XML (and many other formats), escapes such as character entity references are what their name says, escape hatches. That means that you should only use them in "emergency situations". In the example at hand, most people, starting with the bearer(s) of that name, will be able to read Ándre without problems. But &xC1;ndre requires table lookup in Unicode or some other mental gymnastics. The preference for using characters directly, rather than escapes, is formally put down at http://www.w3.org/TR/charmod/#C047. This is in "Character Model for the World Wide Web 1.0: Fundamentals", which, in contrast to the Normalization part you cited, is a W3C Recommendation. C047 says: >>>>>>>> C047 [I] [C] Escapes SHOULD only be used when the characters to be expressed are not directly representable in the format or the character encoding of the document, or when the visual representation of the character is unclear. >>>>>>>> The [I] says that this applies to implementers, the [C] says that this applies to content. The "are not directly representable" would apply if e.g. your document is encoded in Shift_JIS (which doesn't have 'Á'). The "the visual representation of the character is unclear" applies e.g. for because it may be desirable when looking at the source that there's a non-breaking space there rather than a plain space. It may also apply if you don't have an editor that can show that character, if you e.g. can't input it, or if you are not familiar enough with the character/script to make sure you get the right one. But the former two are rare these days, and the later should better be avoided, because the person inputting/checking may have the same problem when looking at an Unicode table. Regards, Martin.
Received on Thursday, 28 February 2013 03:01:55 UTC