RE: Changes to Essential definitions related to character encodings and Serving XHTML 1.0

> From: Gunnar Bittersmann []
> Sent: 26 August 2010 16:05
> To: Richard Ishida
> Cc:
> Subject: Re: Changes to Essential definitions related to character encodings
> and Serving XHTML 1.0
> Richard Ishida scripsit (2010-08-20 10:18+02:00):
> > a new version of the document 'Serving XHTML 1.0'
> Markup: It’s <b class="newterm"> for “HTTP header”, but <span
> class="newterm"> for the other terms. Make it 'b' for all, HTML5 style.
> (The 'dfn' element type might also be appropriate, though.)

Thanks. I had started making such changes this morning, but hadn't got to this document yet.  All changed to dfn.

> In the previous version, “sends information” was linked to
> but is not
> any more. Was that intended?


> Typography: “these MIME types - ie.” Use en dash: these MIME types – ie.

Likewise. Done.

> “They recommend, amongst other things, that you leave a space before the
> '/>' at the end of an empty tag (such as img, hr or br), that you use
> HTML's lang attribute as well as XML's xml:lang attribute, that you
> always use both id and name attributes for fragment identifiers, etc.”:
> Yes, that’s what Appendix C has been saying for years. But as I’ve
> mentioned before, neither the first nor the last hint are still relevant
> for today’s browsers.

And I modified the text with the addition of " These compatibility guidelines are particularly important for legacy versions of browsers." I think it's ok like this - following App C doesn't cause harm.

> “This means that different rules are applied to the display of the file”:
> Hm, it’s definitely not the scope of this article to inform the reader
> about the distinction between files and ressources. But shouldn’t the
> article use the right terminology?

This refers back to " Current mainstream browsers may display an HTML file"

> You’ve added a new paragraph: “In Internet Explorer 6 nothing must
> precede the DOCTYPE declaration in a file. If any character appears
> before it, the document will be served in quirks mode.”
> Just 3 paragraphs down: “ With Internet Explorer 6, however, if anything
> appears before the DOCTYPE declaration the page is rendered in quirks
> mode.”
> Hm, duplicate content. And really anything? BOM?

Good catch.  Removed the first instance, and changed the second to " With Internet Explorer 6, however, if anything other than a byte-order mark appears before the DOCTYPE declaration the page is rendered in quirks mode. "

> “In browsers such as Internet Explorer 7, Firefox, Safari, Opera, and
> others”:
> Should Chrome be explicitly mentioned?

Ok. Done.

> “Since Internet Explorer 6 users may still count for a significant
> proportion of your intended audience”,
> “on Internet Explorer 6 (and therefore for a potentially significant
> proportion of your audience).”:
> Is there really any web site around these dasys whose intended audience
> has a significant proportion of IE 6 users? (If there is, I pity its web
> developer.)

There are corporate environments that are still using IE6, but one has to hope that they will eventually see the light...

> “If you want to ensure that your pages are rendered in the same way on
> all standards-compliant browsers”:
> This sound as if IE 6 was a standards-compliant browser. Ehm – nah.

Changed to " If you want to ensure that your pages are rendered in the same way as on all other standards-compliant browsers, "

> > and some substantial reductions to the text in the 'MIME types' section of
> 'Essential definitions related to character encodings'
> Typography: “these MIME types - ie.” Use en dash: these MIME types – ie.

I have now completely removed the sections in that are dealt with by the Serving XHTML document.  I also renamed the latter document to Serving HTML & XHTML to make it a more general introduction to mime types, and standards vs quirks.

> Finally, some remarks regarding
> “a script that uses accents or diacritics.” Make it: a script that uses
> accents or other diacritics.


> “There are four Normalization Forms specified by the Unicode Standard:
> NFC, NFD, NFKC and NFKD. The 'C' stands for (pre-)composed, and the 'D'
> for decomposed.”:
> This raises the question what the 'K' stands for – and leaves it unanswered.

Yes. I hesitated to bring attention to the compatibility stuff, but you're right, and I added a sentence.

> “If the word 'világ' is used in precomposed form in the HTML (eg. <span
> class="világ">), but in decomposed form in the CSS (eg. .világ {
> font-style: italic; })”:
> If there is any diffenrence between the two 'á', it’s not visible. (I
> haven’t tried a hex editor.) Maybe use a CSS escape: .vila\301 g

Yes, there is a difference, but I didn't want to make it obvious by using a visible escape, since I didn't want to encourage the use of escapes.  I don't think it's really that important that people know that there's actually a difference in the examples used - part of the point is that visually you can't tell anyway, much of the time.

> “The best way to ensure this, especially if the HTML and the CSS files
> are authored by different people, is to use one particular Unicode
> normalization form for all authored content. As we said above, the W3C
> recommends NFC.
> This is likely to be a particular issue if the markup and the CSS are
> being authored or maintained by different people.”
> Duplicate content, remove the latter paragraph.

Yikes, how did I miss that?  Now hopefully better.

> Apart from the technical POV of this article, I just happend to run into
> this trouble: Optima is a font good-looking font on Mac, but does not
> have a glyph for 'ř' as in 'Dvořák'. The browser takes a glyph from the
> next font in the font-family declaration that has a glyph for 'ř'.
> Now there are three options:
> (1) Ignore that there’s a patchwork. Not a good solution.
> (2) Use NFC characters and a font that provides the needed glyphs. From
> a technical POV the best solution, but to refrain from using a
> good-looking font just because of some occasional characters?
> (3) Use NFD characters such as 'r&#x30C;'. Technically questionable, but
> best typography.
> Which one to take?

First, I should say that the article itself is concerned with HTML and CSS code - ie. id and class names.  In this case, I don't think the look of the text is as important as the function, and I would argue that NFC is always the way to go.

For content, things are more flexible. My view is that it's probably better to stay with NFC if you can, but I wouldn't make that an iron-fast rule. I think that saving the file in NFC but using escapes might be a best approach, although really I can't see this as a terribly good solution for large amounts of text. I think it should always be viewed as a hack until the font is improved or a better font is found - fonts are always subject to limitations, and I think it's better to put pressure on the font vendor/developer to improve the font than to quietly sanction it by working around its limitations.

Thanks again for your helpful comments, Gunnar.


> Regards,
> Gunnar

Received on Thursday, 2 September 2010 10:22:18 UTC