W3C home > Mailing lists > Public > public-html@w3.org > November 2007

RE: use of character entities (was: Re: Joint meeting at TPAC from HTML and i18n core WG minutes 2007-11-09)

From: Richard Ishida <ishida@w3.org>
Date: Tue, 20 Nov 2007 19:01:47 -0000
To: "'Henri Sivonen'" <hsivonen@iki.fi>, "'Martin Duerst'" <duerst@it.aoyama.ac.jp>
Cc: <public-html@w3.org>, <public-i18n-core@w3.org>
Message-ID: <01a601c82ba7$cd2117d0$6401a8c0@rishida>

> -----Original Message-----
> From: public-i18n-core-request@w3.org 
> [mailto:public-i18n-core-request@w3.org] On Behalf Of Henri Sivonen
> Sent: 20 November 2007 08:35
...
> On Nov 20, 2007, at 08:12, Martin Duerst wrote:
> 
> >> Validator checking entity reqs
> >>
> >>  Henri: I don't check that character entities are only used for  
> >> characters that are unclear.
> >>  ... because I can't tell mechanically whether the character is  
> >> unclear
> >
> > I think you could tell mechanically if you had a list of these.
> 
> Yes, but there is no objective list at this time in the spec 
> or normatively referenced by the spec. (And even if there 
> were, I'm not convinced that checking would be a good idea.)

The spec = charmod or html5 ?


> > The world may not collapse if you happen to occasionally ignore a 
> > SHOULD. But then, that's why it's a SHOULD, not a MUST.
> 
> I think the bar for making conformance requirements against 
> technically unnecessary but technically harmless things 
> should be very high. Escaped characters produced exactly as 
> good a DOM as unescaped characters, so they are technically 
> harmless except for the extra bytes transferred over the 
> network. Making something like this a SHOULD devalues SHOULDs.
> 
> > I think that on this issue, Bjoern Hoermann once theatened 
> to create 
> > something like a validator that would produce an error message for 
> > each and every 'clear' character encoded as an entity.
> >
> > This would of course be very bad usability design.
> 
> Indeed.
> 
> > For users, it would first be much better if this produced a warning,
> 
> Suggesting warnings instead of errors is a typical way to cop 
> out of considering which spec requirements really need to be 
> requirements.  
> Emitting a warning here would still devalue validator 
> *messages* in general and would produced as much output for 
> the user to read.  
> Changing errors to warnings doesn't improve usability. It 
> potentially makes it worse since it means the user needs to 
> think more.
> 
> > not an error (after all, it's just a SHOULD),
> 
> I disagree that SHOULD equals warning. SHOULDs are technical 
> requirements and violations of technical requirements are 
> errors. If a spec author wanted merely to document and 
> aesthetic convention, SHOULD is inappropriate.
> 
> I think it is OK to use warnings when:
>   1) The author is doing something that actually might cause 
> technical harm and the validator developer would have wanted 
> to emit an error but couldn't find spec text to back it up.
> OR
>   2) The situation genuinely requires human inspection to 
> determine whether there is actual technical harm.
> 
> > and second, if the message was aggregated
> > ("Warning: 200 unnecessary character entities detected, you 
> may want 
> > to change them to actual characters (e.g. &#xABCD; -> @@).").
> 
> If you are the author, perhaps you had a reason to use 
> escapes--such as an input method that is limited or wanted 
> CMS source code to be all ASCII in order to avoid having to 
> deal with non-ASCII program code issues in version control.

It's not only about technical issues.  It's also about readability of the
source text.  It's quite hard to maintain text that looks like this when
you're looking at the source:

Jako efektivn&#x115;j&#x161;&#xED; se n&#xE1;m jev&#xED;
po&#x159;&#xE1;d&#xE1;n&#xED; tzv. Road Show prost&#x159;ednictv&#xED;m
na&#x161;ich autorizovan&#xFD;ch dealer&#x16F; v &#x10C;ech&#xE1;ch a na
Morav&#x11B;, kter&#xE9; prob&#x11B;hnou v pr&#x16F;b&#x11B;hu
z&#xE1;&#x159;&#xED; a &#x159;&#xED;jna.

I agree that there can be a number of good reasons for including escapes,
but the SHOULD says that as long as you are *doing it for a good reason*
that is ok. The onus is on the producer of the content to be satisfied that
they have done the right thing. Perhaps the problem here is that not all the
Charmod conformance criteria can be mapped automatically to the type of
criteria currently used in the validator.

The key problem for a validator in this case is how to decide whether the
content developer is doing it for a good reason. I don't think there's a way
to do that.  

I think we agree that the validator can't call this an error.  I also agree
with Martin that calling out every use of an escape with a separate warning
is overkill.  I don't see why, however, in the interests of being helpful to
a content author the validator can't produce one message that advises them
that there are x number of escapes detected, that it is generally best to
avoid those unless there is a good reason for them being there, and that
they should check whether they can't eliminate some. (Note that this is
different from what Martin suggested in a couple of points:

1. no mention of 'unnecessary', since that's hard to tell
2. character entities -> escapes (character entities are things like
&aacute;))

Whether you want to classify this as a warning or not may be a separate
question.
 
Cheers,
RI


...

> The way I see it is that a validator should check that a 
> document meets requirements placed on content [C]. However, 
> doing so for C047 and C048 requirement would devalue 
> validation messages.
> 
> -- 
> Henri Sivonen
> hsivonen@iki.fi
> http://hsivonen.iki.fi/
> 
> 
> 
Received on Tuesday, 20 November 2007 18:59:10 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:38:50 UTC