W3C home > Mailing lists > Public > www-international@w3.org > January to March 2010

RE: For review: Character encodings in HTML and CSS

From: Richard Ishida <ishida@w3.org>
Date: Thu, 18 Feb 2010 21:39:09 -0000
To: "'Leif Halvard Silli'" <xn--mlform-iua@xn--mlform-iua.no>
Cc: <www-international@w3.org>
Message-ID: <012401cab0e2$ceaa3880$6bfea980$@org>
See notes below...

============
Richard Ishida
Internationalization Lead
W3C (World Wide Web Consortium)

http://www.w3.org/International/
http://rishida.net/




> -----Original Message-----
> From: Leif Halvard Silli [mailto:xn--mlform-iua@målform.no]
> Sent: 10 February 2010 06:09
> To: Richard Ishida
> Cc: www-international@w3.org
> Subject: Re: For review: Character encodings in HTML and CSS
> 
> Richard Ishida, Tue, 9 Feb 2010 13:20:29 -0000:
> > Comments are being sought on this article prior to final release.
> > Please send any comments to this list (www-international@w3.org). We
> > expect to publish a final version in one to two weeks.
> >
> > See http://www.w3.org/International/tutorials/tutorial-char-enc/temp
> 
> > The rearrangement was to downplay slightly the XHTML 1.0 issues,
> > given that that is now only relevant to IE6,
> 
> > The update adds information about HTML5.
> 
> Here are the additional things that I would have liked to know when
> reading such a document ...
> 
> (1) It should be mentioned that in SGML based mark-up, such as HTML4,
> one may omit the ";" in NCRs. All the big 6 (IE, Firefox, Opera,
> Webkit, Konqueror, Chrome [assuming it is like Webkit]) desktop
> browsers supports this _inside attributes_.   (I have a quite thorough
> test document here: <http://målform.no/ncr-test/> ) They also all
> support it for text, except that IE has an exception when it comes to
> NCRs directly in text: Then, for hex NCRs, IE requires semicolons,
> while for decimal NCRs it does not require it. [IE got support for hex
> NCRs later on, didn't it? Must be a bug ... !] So one could give the
> usage advice that is "better" and simpler to use the semicolon than to
> avoid it. But still tell that it is permitted to drop it. (My view is
> that it should be permitted in HTML5 too.) Another part of the advice
> could be that it is safer - and more justified - to use inside machine
> readable attributes than inside human readable text.

I have added a note to the text that says:
"Some browsers allow you to omit the semicolon at the end of a numeric character reference, but this is not recommended, since it may lead to interoperability problems. Using the semicolon also avoids the potential problem of the end of the escape becoming undetectable when the escape is embedded in text."


> 
> (2) The document appears thin when it comes to CSS escapes.
> 
>   * The explanation of what an CSS escape is, is now located under the
> heading "What are entities and NCRs?"
> <http://www.w3.org/International/tutorials/tutorial-char-enc/temp#what>.
> I think a separate header for CSS escapes would be better. Or,
> alternatively, that the existing heading should be changed to say "What
> are entities, NCRs and CSS escapes?".
>   * There should also be a CSS escape example, the same way that there
> already are yellow colored examples of NCR and entities.
>   * (One of the) CSS examples could e.g. show what it means in practise
> that the space character terminates the CSS escape, as this can be
> highly confusing for authors. This can best be shown by having a CSS
> selectors which contains only escaped letters, or a selector consisting
> of 3 letters with the escaped one in the middle:
> 
> .mål{}
> 
> becomes (note the space)
> 
> .m\0000e5 l{}

Thank you for pointing this out.  I have added a new section entitled "CSS escapes". (It's something I've been meaning to do for a long time, but it somehow got overlooked.)


> 
> (3) Specification of the encoding of an external CSS file: The text
> currently says that
> 
>     ]]If your external CSS style sheet contains any non-ASCII text [
> snip ] you should use the @charset rule as the first thing on the page.
> (It should not be used for CSS embedded in a document.)"[[
> 
>     However, I think many authors are not aware that they may use HTTP
> to signal the charset of CSS files as well. Therefore I think you
> should mention this. (You already mentioned another alternative in that
> context, namely to use the BOM. BOM has issues of support you say, but
> HTTP work very well, AFAIK.)

I added a paragraph to the section on using HTTP to make that clearer.


> 
> (4) The logics of using escapes in @style and <style> and stylesheets:
>   * I believe many web authors think they /have/ to use escapes e.g. in
> CSS selectors. So I think that the document should say that they don't
> have to - they can often type them directly - especially if CSS and
> HTML are located in the same document ...

I added the following para to the start of the section:

"It is best to choose the right encoding so that you can just use characters in CSS declarations. This section addresses what should be a very rare circumstance where you may have decided to use escapes."

> 
> (5) I believe that many authors are not aware that they may use
> character escapes inside (many) HTML attributes. Hence I think a word
> should be said about that the thing that this is in fact possible. (You
> talk about the style attribute, but @style is - or may appear - as a
> special case.
> 
> (6) You say that it is better to use CSS escapes inside the @style
> attribute. And the reason you give is related to the possible need for
> moving the escapes to the <style> element, or perhaps even to an
> external (CSS) file. In the same spirit, you should mention that one
> reason for using NCRs and entities can be that one wants to be able to
> present the same file in different encodings - without actually
> re-encoding the file first. You could perhaps add this inside or near
> the paragraph about "Encoding gaps".

I think that is rather an advanced topic for this tutorial.


> 
> (7) Length of escapes: It should be added words about whether there is
> length limits/requirments of NCRs and CSS escapes:
>     * CSS2.1 limits the length to (I believe) 6 alphanumeric characters
> after the '\' and before the space character. No browser accept CSS
> escapes that are longer than the limit either.
>     * For HTML, then there is no specified limits. But in practise:
> Opera, Lynx and Firefox appears to accept endless escapes (such as
> &#0000000000000229;) whereas Webkit has a limit that looks to be 8
> characters, including zeros, and regardles of hex or dec. While IE
> seems to have the exact same limit as in CSS (6 characters for hex NCR
> - which is like the length limit in CSS escapes, and 7 characters for
> dec NCRs [to be able to write the hex values with dec numbers, I
> suppose.]) See again my test case: <http://målform.no/ncr-test/> -
> which tests only the letter 'ü' in different NCR "encodings".
> 	Thus, the advice could perhaps be to follow the CSS rules about the
> length of the escape: not longer than 6 letters. (Making them longer
> can be useful for targeting particular browsers though ...)
> 
> (8) You say that &apos; is not defined in HTML. However, it is defined
> in the HTML5 language specification draft. Thus, the advice to not use
> it because it is not defined in HTML, appears as solely a specification
> compatibility advice. It would perhaps be more relevant to, eventually,
> point to lack of user agent support (IE = no support,  Webkit =
> support).

Added. 


> 
> (9) You say "Here we present a quick summary of how to declare
> character encodings in the following formats:" And then you first of
> all list "HTTP". Is "HTTP" considered a format? I suggest you say
> "protocols and formats" instead of "formats". Either that, or you
> should, in the list, say "HTTP headers" instead of "HTTP" - as I
> suppose a "HTTP header" can be described as a format.

Done.

 
> (10) Another purpose of escapes is to circumvent browser bugs and
> syntax limitations. E.g. Internet Explorer has (surprise) many bugs.
> One of them is that the CSS selector "engine" of at least IE6 and IE7
> does not accept, as first character in a class name, all the characters
> that CSS permits.) For instance IE6 does not accept the '-'
> (hyphen-minus) as first letter. However, by  (inside a selector)
> preceding the '-' with a '\', then it becomes selectable even in IE6.
> CSS selector syntax also has built-in limitations, which can be escaped:
> 
> 	*.7{}
> 
> is not a valid selector, while
> 
> 	*.\7{}
> 
> is a valid CSS selector

I think this is also for a separate article for more advanced readers.


> 
> (11) You say "[...] you may feel you need to additionally use the
> encoding attribute of the XML declaration. On the other hand, you
> should be aware that this could cause rendering issues [....] quirks
> mode.
> 
> Instead of "that this could", please say "that the XML declaration
> could". Or else, a sloppy/unaware reader could think that it is the
> encoding attribute rather than the declaration which causes the quirks.
> (My point is that whether you use the encoding attribute or not [can it
> be skipped?] is not what brings you into quirks mode - it is the
> declaration itself which - due to the way IE's doctype switch works -
> is causing the - ah - quirk.

Done.

> 
> Also, isn't there some way to work around the issue that the
> declaration causes quirks mode? Like placing a HTML comment before the
> declaration or something? (Very long time since I looked into that
> thing.) I understand the wish to promote UTF-8, but if the declaration
> does any good, then a way to use XML declarations without bringing
> anyone into quirks mode, would be a useful tip. (And more focused on
> the topic of the article: encodings - rather than talking about quirks
> mode that much ... see below.)

Not that I'm aware of.


> 
> (12) Finally, things I do not especially want to see in such a
> document: I'm often surprised when I see how many things that appear
> under the i18n heading at www.w3.org ...  And in this document: quirks
> mode ???  Isn't that to stretch it, to talk about quirks mode in a
> document about character encoding? I think the issues of quirks mode
> should be explained somewhere, but not necessarily in this document, as
> I think there are no issues w.r.t. interpretation of encodings and
> escapes etc in regard to quirks mode. The only thing is the XML
> declaration. Quirks mode appears to me as a deviation from the main
> topic!

An understanding of the concepts is important to understand the points about the XML declaration, and I think it is ok to briefly address the topic here in that vein.  On the other hand, I have moved that section under Essential definitions, since I was unhappy with it's current location.


> 
> (13) It would be far more relevant to bring in URL escaping than to
> talk about Quirks Mode! URL escaping also quite confusing thing to
> authors ... It is also an issue where HTML4 is not in tune with
> reality: IRIs.

I agree that it would probably be useful to say something about URL escaping.  I will try to add something, but not today.


> 
> OK. I expect that you will not agree with all I've said, and that you
> will not take notice of all this. But I hope you found some of it
> useful ...
> --
> leif halvard silli

Thank you for your comments.

RI
Received on Thursday, 18 February 2010 21:39:41 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 18 February 2010 21:39:43 GMT