Re: use of character entities (was: Re: Joint meeting at TPAC from HTML and i18n core WG minutes 2007-11-09) from Henri Sivonen on 2007-11-22 (public-html@w3.org from November 2007)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 22 Nov 2007 15:17:58 +0200
To: Richard Ishida <ishida@w3.org>
Cc: "'Martin Duerst'" <duerst@it.aoyama.ac.jp>, <public-html@w3.org>, <public-i18n-core@w3.org>
Message-Id: <195CC333-44C0-4933-9467-7E5FBC19F422@iki.fi>

On Nov 20, 2007, at 21:01, Richard Ishida wrote:

>> On Nov 20, 2007, at 08:12, Martin Duerst wrote:
>>
>>>> Validator checking entity reqs
>>>>
>>>> Henri: I don't check that character entities are only used for
>>>> characters that are unclear.
>>>> ... because I can't tell mechanically whether the character is
>>>> unclear
>>>
>>> I think you could tell mechanically if you had a list of these.
>>
>> Yes, but there is no objective list at this time in the spec
>> or normatively referenced by the spec. (And even if there
>> were, I'm not convinced that checking would be a good idea.)
>
> The spec = charmod or html5 ?

I meant Charmod.

>>> and second, if the message was aggregated
>>> ("Warning: 200 unnecessary character entities detected, you may want
>>> to change them to actual characters (e.g. &#xABCD; -> @@).").
>>
>> If you are the author, perhaps you had a reason to use
>> escapes--such as an input method that is limited or wanted
>> CMS source code to be all ASCII in order to avoid having to
>> deal with non-ASCII program code issues in version control.
>
> It's not only about technical issues.  It's also about readability  
> of the
> source text.  It's quite hard to maintain text that looks like this  
> when
> you're looking at the source:
>
> Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED;
> po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show  
> prost&#x159;ednictv&#xED;m
> na&#x161;ich autorizovan&#xFD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch  
> a na
> Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x11B;hu
> z&#xE1;&#x159;&#xED; a &#x159;&#xED;jna.

I don't think it is the job of conformance definitions to impose the  
spec writers' notion of maintainability on authors.

> The key problem for a validator in this case is how to decide  
> whether the
> content developer is doing it for a good reason. I don't think  
> there's a way
> to do that.

I agree. When a requirement is not machine-checkable, the options are  
not even trying to check for the requirement or detecting the possible  
violation and using a warning to tell the user to check.

> I don't see why, however, in the interests of being helpful to
> a content author the validator can't produce one message that  
> advises them
> that there are x number of escapes detected, that it is generally  
> best to
> avoid those unless there is a good reason for them being there, and  
> that
> they should check whether they can't eliminate some.

That's the approach I took with PUA characters, since they actually  
are bad for interop on the public network if used for something that  
really should be done without the PUA. However, using an escaped  
characters is so common, interoperable and harmless that I think  
bothering the user would only devalue validation messages.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 22 November 2007 13:18:30 UTC