Re: Why is the accented A character preferred over the equivalent character entity reference? from Martin J. Dürst on 2013-02-28 (www-international@w3.org from January to March 2013)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Thu, 28 Feb 2013 12:01:15 +0900
To: "Costello, Roger L." <costello@mitre.org>
CC: "www-international@w3.org" <www-international@w3.org>
Message-ID: <512EC87B.6010206@it.aoyama.ac.jp>

Hello Roger,

On 2013/02/28 8:03, Costello, Roger L. wrote:
> Hi Folks,
>
> In the document "Character Model for the World Wide Web 1.0: Normalization" it says this at the bottom of section 3.3.3:
>
>      With appropriate entity definitions, instead of A&acute;,
>      write&Aacute;

Just while we are at it, this is because &Aacute; will be in NFC when 
the entity reference is resolved, but A&acute; will not be in NFC.

> (or better, use 'Á' directly).
>
> The statement in parenthesis is particularly intriguing. Is it suggesting that Best Practice is to write this:
>
>  <Name>Ándre</Name>
>
> rather than this:
>
>  <Name>&xC1;ndre</Name>
>
> where&xC1; is the character entity reference for Á.
>
> Why is the former preferred over the latter?

In HTML and XML (and many other formats), escapes such as character 
entity references are what their name says, escape hatches. That means 
that you should only use them in "emergency situations". In the example 
at hand, most people, starting with the bearer(s) of that name, will be 
able to read Ándre without problems. But &xC1;ndre requires table lookup 
in Unicode or some other mental gymnastics.

The preference for using characters directly, rather than escapes, is 
formally put down at http://www.w3.org/TR/charmod/#C047. This is in 
"Character Model for the World Wide Web 1.0: Fundamentals", which, in 
contrast to the Normalization part you cited, is a W3C Recommendation. 
C047 says:

 >>>>>>>>
C047  [I]  [C]  Escapes SHOULD only be used when the characters to be 
expressed are not directly representable in the format or the character 
encoding of the document, or when the visual representation of the 
character is unclear.
 >>>>>>>>

The [I] says that this applies to implementers, the [C] says that this 
applies to content. The "are not directly representable" would apply if 
e.g. your document is encoded in Shift_JIS (which doesn't have 'Á'). The 
"the visual representation of the character is unclear" applies e.g. for 
&nbsp; because it may be desirable when looking at the source that 
there's a non-breaking space there rather than a plain space. It may 
also apply if you don't have an editor that can show that character, if 
you e.g. can't input it, or if you are not familiar enough with the 
character/script to make sure you get the right one. But the former two 
are rare these days, and the later should better be avoided, because the 
person inputting/checking may have the same problem when looking at an 
Unicode table.

Regards,   Martin.

Received on Thursday, 28 February 2013 03:01:55 UTC