RE: charset from Harley Rosnow on 2008-04-10 (public-html-comments@w3.org from April 2008)

From: Harley Rosnow <Harley.Rosnow@microsoft.com>
Date: Thu, 10 Apr 2008 15:47:07 -0700
To: "rick.denhaan@gmail.com" <rick.denhaan@gmail.com>, "public-html-comments@w3.org" <public-html-comments@w3.org>
Message-ID: <7AD436E4270DD54A94238001769C2227011DAFFE3F23@DF-GRTDANE-MSG.exchange.corp.micro>
Hi Rick,

Let me start by saying that I'm not Microsoft's representative on the WG, so please don't take my statements as the WG's official response nor Microsoft's official position on these issues.  These are my own opinions based on my experience.  With that out of the way, let's talk.

Clearly the best solution is to use UTF8 and indicate that fact with a BOM at the start of the document.  It can represent all characters and the BOM obviates the need for scanning since it's at the very start of the file.  Also, the presence of an HTTP header with the encoding has the same impact on efficiency because it's available before the document content.

That said, if you do need to include text from languages which cannot be represented in the encoding of the document, there's an existing solution already in HTML 5.0: character entity references: http://www.w3.org/TR/html5/#character.  The cool thing about these is that they open up the entire UNICODE repertoire to documents in any encoding.  The uncool thing is that they use a larger number of bytes to represent each character.  If the scenario is true multilingual text and a compact representation is a requirement, then a UNICODE encoding is the way to go and UTF-8 is the best choice given the prevalence of ASCII characters in HTML markup, CSS and script.

One problem with the proposal below is that we can't reliably scan the entire document at the start.  The document is downloaded off the network in buffers and we need to be able to push those buffers as far through the pipeline as possible as they arrive.

Your concern about the allocation, copying, freeing and handling of buffers is well considered.  That's probably going to determine a lot of the performance and complexity of such a proposal.

But, what really makes this proposal undesirable to me is that it adds unnecessary complexity to the standard.  We have UTF-8.  It's also vulnerable to cross-site scripting (XSS) attacks because parts of the document can have their encoding switched mid-stream through the injection of markup.  As any of the authors of server-side XSS filters can attest, we need to keep the logic to decode our files as simple and deterministic as possible.

Thanks,

Harley Rosnow
Internet Explorer Development
Microsoft Corporation

-----Original Message-----
From: Rick den Haan [mailto:rick.denhaan@gmail.com]
Sent: Thursday, April 10, 2008 1:38 AM
To: Harley Rosnow; public-html-comments@w3.org
Subject: RE: charset

NOTE: I'm not a WG member, so this is most definitely not an official
response.

Harley Rosnow wrote:
> Servers that compose files together need make their encoding
> consistent in the rendered composite file.  The same holds true
> for composition which occurs on the client.

True, but what if you need to have, e.g. Russian, Hebrew and Chinese text in
the same document?

Would it perhaps be an option to modify the META tag to allow multiple
values, where the first value is used as default?

For example:

<html>
<head>
    <meta http-equiv="Content-Type" value="text/html;
charset=KOI-8,UTF-8,GB2312">
</head>
<body>
    <section id="russian_content" lang="ru">
        <!-- Since KOI-8 was first in the meta, no charset is required -->
    </section>
    <section id="hebrew_content" lang="ar" dir="rtl" charset="UTF-8">
        <!-- Some Hebrew text, rendered and decoded using the UTF-8 charset
-->
    </section>
    <section id="chinese_content" lang="zh" charset="GB2312">
        <!-- Some Chinese text, rendered and decoded using the GB2312
charset -->
    </section>
</body>
</html>

In this situation, browsers can:

(1) Parse the META tag
(2) If only one charset is given, decode the entire document using that
charset
(3) If multiple charsets were given, preload the given charsets
(3a) Scan the document for elements with a charset-attribute
(3b) When found, decode the contents of that element using the given
charset, and drop it into a buffer
(3c) Decode the rest of the document using the default (first) charset given
(3d) Insert the decoded contents from the buffers into their correct
positions in the document

I'm not a software developer, so I may be thinking too simply here, but I
wouldn't consider this too difficult to implement.  The buffers might be a
memory hog in low-end systems.  And of course, there's the matter of what to
do if someone uses this in combination with Ajaxy-goodness and loads, oh I
don't know, Korean content for example and adds that to the document, while
that charset isn't loaded.

Cheers,
Rick.
Received on Thursday, 10 April 2008 22:47:59 UTC