Re: For review: Character encodings in HTML and CSS from Leif Halvard Silli on 2010-02-10 (www-international@w3.org from January to March 2010)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Wed, 10 Feb 2010 07:09:09 +0100
To: Richard Ishida <ishida@w3.org>
Cc: www-international@w3.org
Message-ID: <20100210070909782876.c813d58a@xn--mlform-iua.no>
Richard Ishida, Tue, 9 Feb 2010 13:20:29 -0000:
> Comments are being sought on this article prior to final release. 
> Please send any comments to this list (www-international@w3.org). We 
> expect to publish a final version in one to two weeks.
> 
> See http://www.w3.org/International/tutorials/tutorial-char-enc/temp


> The rearrangement was to downplay slightly the XHTML 1.0 issues, 
> given that that is now only relevant to IE6,

> The update adds information about HTML5.

Here are the additional things that I would have liked to know when 
reading such a document ...

(1) It should be mentioned that in SGML based mark-up, such as HTML4, 
one may omit the ";" in NCRs. All the big 6 (IE, Firefox, Opera, 
Webkit, Konqueror, Chrome [assuming it is like Webkit]) desktop 
browsers supports this _inside attributes_.   (I have a quite thorough 
test document here: <http://målform.no/ncr-test/> ) They also all 
support it for text, except that IE has an exception when it comes to 
NCRs directly in text: Then, for hex NCRs, IE requires semicolons, 
while for decimal NCRs it does not require it. [IE got support for hex 
NCRs later on, didn't it? Must be a bug ... !] So one could give the 
usage advice that is "better" and simpler to use the semicolon than to 
avoid it. But still tell that it is permitted to drop it. (My view is 
that it should be permitted in HTML5 too.) Another part of the advice 
could be that it is safer - and more justified - to use inside machine 
readable attributes than inside human readable text.

(2) The document appears thin when it comes to CSS escapes. 

  * The explanation of what an CSS escape is, is now located under the 
heading "What are entities and NCRs?" 
<http://www.w3.org/International/tutorials/tutorial-char-enc/temp#what>. 
I think a separate header for CSS escapes would be better. Or, 
alternatively, that the existing heading should be changed to say "What 
are entities, NCRs and CSS escapes?". 
  * There should also be a CSS escape example, the same way that there 
already are yellow colored examples of NCR and entities.
  * (One of the) CSS examples could e.g. show what it means in practise 
that the space character terminates the CSS escape, as this can be 
highly confusing for authors. This can best be shown by having a CSS 
selectors which contains only escaped letters, or a selector consisting 
of 3 letters with the escaped one in the middle:

.mål{} 

becomes (note the space)

.m\0000e5 l{} 

(3) Specification of the encoding of an external CSS file: The text 
currently says that 

    ]]If your external CSS style sheet contains any non-ASCII text [ 
snip ] you should use the @charset rule as the first thing on the page. 
(It should not be used for CSS embedded in a document.)"[[

    However, I think many authors are not aware that they may use HTTP 
to signal the charset of CSS files as well. Therefore I think you 
should mention this. (You already mentioned another alternative in that 
context, namely to use the BOM. BOM has issues of support you say, but 
HTTP work very well, AFAIK.)

(4) The logics of using escapes in @style and <style> and stylesheets:
  * I believe many web authors think they /have/ to use escapes e.g. in 
CSS selectors. So I think that the document should say that they don't 
have to - they can often type them directly - especially if CSS and 
HTML are located in the same document ...

(5) I believe that many authors are not aware that they may use 
character escapes inside (many) HTML attributes. Hence I think a word 
should be said about that the thing that this is in fact possible. (You 
talk about the style attribute, but @style is - or may appear - as a 
special case.

(6) You say that it is better to use CSS escapes inside the @style 
attribute. And the reason you give is related to the possible need for 
moving the escapes to the <style> element, or perhaps even to an 
external (CSS) file. In the same spirit, you should mention that one 
reason for using NCRs and entities can be that one wants to be able to 
present the same file in different encodings - without actually 
re-encoding the file first. You could perhaps add this inside or near 
the paragraph about "Encoding gaps".

(7) Length of escapes: It should be added words about whether there is 
length limits/requirments of NCRs and CSS escapes:
    * CSS2.1 limits the length to (I believe) 6 alphanumeric characters 
after the '\' and before the space character. No browser accept CSS 
escapes that are longer than the limit either.
    * For HTML, then there is no specified limits. But in practise: 
Opera, Lynx and Firefox appears to accept endless escapes (such as 
&#0000000000000229;) whereas Webkit has a limit that looks to be 8 
characters, including zeros, and regardles of hex or dec. While IE 
seems to have the exact same limit as in CSS (6 characters for hex NCR 
- which is like the length limit in CSS escapes, and 7 characters for 
dec NCRs [to be able to write the hex values with dec numbers, I 
suppose.]) See again my test case: <http://målform.no/ncr-test/> - 
which tests only the letter 'ü' in different NCR "encodings".  
 Thus, the advice could perhaps be to follow the CSS rules about the 
length of the escape: not longer than 6 letters. (Making them longer 
can be useful for targeting particular browsers though ...)

(8) You say that &apos; is not defined in HTML. However, it is defined 
in the HTML5 language specification draft. Thus, the advice to not use 
it because it is not defined in HTML, appears as solely a specification 
compatibility advice. It would perhaps be more relevant to, eventually, 
point to lack of user agent support (IE = no support,  Webkit = 
support). 

(9) You say "Here we present a quick summary of how to declare 
character encodings in the following formats:" And then you first of 
all list "HTTP". Is "HTTP" considered a format? I suggest you say 
"protocols and formats" instead of "formats". Either that, or you 
should, in the list, say "HTTP headers" instead of "HTTP" - as I 
suppose a "HTTP header" can be described as a format.

(10) Another purpose of escapes is to circumvent browser bugs and 
syntax limitations. E.g. Internet Explorer has (surprise) many bugs. 
One of them is that the CSS selector "engine" of at least IE6 and IE7 
does not accept, as first character in a class name, all the characters 
that CSS permits.) For instance IE6 does not accept the '-' 
(hyphen-minus) as first letter. However, by  (inside a selector) 
preceding the '-' with a '\', then it becomes selectable even in IE6. 
CSS selector syntax also has built-in limitations, which can be escaped:

 *.7{} 

is not a valid selector, while 

 *.\7{} 

is a valid CSS selector

(11) You say "[...] you may feel you need to additionally use the 
encoding attribute of the XML declaration. On the other hand, you 
should be aware that this could cause rendering issues [....] quirks 
mode.

Instead of "that this could", please say "that the XML declaration 
could". Or else, a sloppy/unaware reader could think that it is the 
encoding attribute rather than the declaration which causes the quirks. 
(My point is that whether you use the encoding attribute or not [can it 
be skipped?] is not what brings you into quirks mode - it is the 
declaration itself which - due to the way IE's doctype switch works - 
is causing the - ah - quirk.

Also, isn't there some way to work around the issue that the 
declaration causes quirks mode? Like placing a HTML comment before the 
declaration or something? (Very long time since I looked into that 
thing.) I understand the wish to promote UTF-8, but if the declaration 
does any good, then a way to use XML declarations without bringing 
anyone into quirks mode, would be a useful tip. (And more focused on 
the topic of the article: encodings - rather than talking about quirks 
mode that much ... see below.)

(12) Finally, things I do not especially want to see in such a 
document: I'm often surprised when I see how many things that appear 
under the i18n heading at www.w3.org ...  And in this document: quirks 
mode ???  Isn't that to stretch it, to talk about quirks mode in a 
document about character encoding? I think the issues of quirks mode 
should be explained somewhere, but not necessarily in this document, as 
I think there are no issues w.r.t. interpretation of encodings and 
escapes etc in regard to quirks mode. The only thing is the XML 
declaration. Quirks mode appears to me as a deviation from the main 
topic!

(13) It would be far more relevant to bring in URL escaping than to 
talk about Quirks Mode! URL escaping also quite confusing thing to 
authors ... It is also an issue where HTML4 is not in tune with 
reality: IRIs. 

OK. I expect that you will not agree with all I've said, and that you 
will not take notice of all this. But I hope you found some of it 
useful ...
-- 
leif halvard silli
Received on Wednesday, 10 February 2010 06:09:47 UTC