W3C home > Mailing lists > Public > public-html-comments@w3.org > April 2008

RE: charset

From: Harley Rosnow <Harley.Rosnow@microsoft.com>
Date: Wed, 9 Apr 2008 10:26:30 -0700
To: hrhrhr hahaha <kevking@hotmail.com>, "public-html-comments@w3.org" <public-html-comments@w3.org>
Message-ID: <7AD436E4270DD54A94238001769C2227011DAFFE3BBB@DF-GRTDANE-MSG.exchange.corp.microsoft.com>

Hello,

Such a multi-charset feature would be very difficult for user agents to support.  I work on Internet Explorer's "charset" implementation.  Simply put, in the implementation of the "charset" META tag, data is (1) placed in a buffer, (2) scanned for a charset META tag in the head, (3) decoded based on that charset, (4) tokenized by a lexer and (5) the tokens are parsed into some internal representation.  While user agent implementations vary and I've greatly simplified things, it's still useful to think of the processing in these terms.

The difficulty of this suggestion is that the user agent wouldn't be able to identify and handle charset attributes until [I] during the parsing phase (5) or [II] by adding an early scan that violates the above phasing.  If discovered during parsing, the appropriate portion of the file would have to be reset back to phase (3), decoded again, retokenized and reparsed.  On the other hand, the user agent could attempt to find such attributes at a much earlier phase.  There are many encodings that have problematic characteristics such as characters from the ASCII range (used for elements, attributes, etc.) not being encoded as ASCII characters in the stream or shift states that change the interpretation of bytes.  As a result, these early scans are very difficult and expense (in terms of performance) to implement early.  For either approach, this kind of reinterpretation of data lends itself to security attacks (I could hide script in a different encoding from the rest of the file), makes XSS filtering much more difficult, leads to bad performance issues and ultimately inconsistent implementation across the different user agents.

I'd recommend against such a multi-charset feature.  Servers that compose files together need make their encoding consistent in the rendered composite file.  The same holds true for composition which occurs on the client.  Thanks,

Harley Rosnow
Internet Explorer Development
Microsoft Corporation

-----Original Message-----
From: public-html-comments-request@w3.org [mailto:public-html-comments-request@w3.org] On Behalf Of hrhrhr hahaha
Sent: Wednesday, April 09, 2008 7:42 AM
To: public-html-comments@w3.org
Subject: charset


Hi,

IF a page has its charset set in the head section, via meta tag, and elsewhere, within the same page is another charset used, aside from wrapping it in an a tag/element, what would you use? Maybe, alongside lang (and xml:lang) the charset 'could' be added to span? Even using utf-8, there could be a charset used NOT in or recognised by utf, that could be 'added' to the page via inline tag/element?
_________________________________________________________________
More immediate than e-mail? Get instant access with Windows Live Messenger.
http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008
Received on Thursday, 10 April 2008 06:42:40 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 June 2011 00:13:58 GMT