W3C home > Mailing lists > Public > public-html-comments@w3.org > April 2008

RE: charset

From: Harley Rosnow <Harley.Rosnow@microsoft.com>
Date: Thu, 10 Apr 2008 15:52:42 -0700
To: TAN Kuan Hui <tankuanhui@gmail.com>, hrhrhr hahaha <kevking@hotmail.com>, "public-html-comments@w3.org" <public-html-comments@w3.org>
Message-ID: <7AD436E4270DD54A94238001769C2227011DAFFE3F29@DF-GRTDANE-MSG.exchange.corp.microsoft.com>

I completely agree that it's best if the encoding is encapsulated in the document rather than placed in all the consuming documents.  That goal can also be achieved through the use of UTF-8 and the placement of a BOM at the start of the file.  Thanks,

Harley Rosnow
Internet Explorer Development
Microsoft Corporation

-----Original Message-----
From: TAN Kuan Hui [mailto:tankuanhui@gmail.com]
Sent: Thursday, April 10, 2008 1:03 AM
To: Harley Rosnow; hrhrhr hahaha; public-html-comments@w3.org
Subject: Re: charset

Also on the subject of charset; <script> can be further tagged with a
charset declaration
when it has a linked resource. Otherwise it is assumed to be in the same
encoding
as the document.

for example,
<meta http-equiv='Content-Type' content='text/html; charset=UTF-8'><script
type='text/javascript' src='http://someabc.com/scriptA.js'
charset='gb2312'/>
<script type='text/javascript' src='http://somexyz.com/scriptB.js'
charset='big5'/>

if scriptA.js inserts some html code into the document, for example,
    <script type='text/javascript' src='http://someabc.com/scriptAA.js'/>
and fails to declare its encoding, the mashup potentially fails.

The problem is that the encoding should be declared at the script itself and
not on the element. (i.e. the standard should advocate a consistent approach
such as those for style sheets and xml, where the declaration resides with
the source and not with the consuming document). This then totally subsumes
the problem to the developer who wrote the script and the encoding
problem becomes "TRANSPARENT" to the consumer.

I am sure this must have been deliberated in detail in the past, but with
the meteoric rise of ajax, its perhaps worth a revisit. However, I agree
with
Harley that a multi-charset feature might be unnecessarily complicated.

Thanks.


----- Original Message -----
From: "Harley Rosnow" <Harley.Rosnow@microsoft.com>
To: "hrhrhr hahaha" <kevking@hotmail.com>; <public-html-comments@w3.org>
Sent: Thursday, April 10, 2008 1:26 AM
Subject: RE: charset



Hello,

Such a multi-charset feature would be very difficult for user agents to
support.  I work on Internet Explorer's "charset" implementation.  Simply
put, in the implementation of the "charset" META tag, data is (1) placed in
a buffer, (2) scanned for a charset META tag in the head, (3) decoded based
on that charset, (4) tokenized by a lexer and (5) the tokens are parsed into
some internal representation.  While user agent implementations vary and
I've greatly simplified things, it's still useful to think of the processing
in these terms.

The difficulty of this suggestion is that the user agent wouldn't be able to
identify and handle charset attributes until [I] during the parsing phase
(5) or [II] by adding an early scan that violates the above phasing.  If
discovered during parsing, the appropriate portion of the file would have to
be reset back to phase (3), decoded again, retokenized and reparsed.  On the
other hand, the user agent could attempt to find such attributes at a much
earlier phase.  There are many encodings that have problematic
characteristics such as characters from the ASCII range (used for elements,
attributes, etc.) not being encoded as ASCII characters in the stream or
shift states that change the interpretation of bytes.  As a result, these
early scans are very difficult and expense (in terms of performance) to
implement early.  For either approach, this kind of reinterpretation of data
lends itself to security attacks (I could hide script in a different
encoding from the rest of the file), makes XSS filtering much more
difficult, leads to bad performance issues and ultimately inconsistent
implementation across the different user agents.

I'd recommend against such a multi-charset feature.  Servers that compose
files together need make their encoding consistent in the rendered composite
file.  The same holds true for composition which occurs on the client.
Thanks,

Harley Rosnow
Internet Explorer Development
Microsoft Corporation

-----Original Message-----
From: public-html-comments-request@w3.org
[mailto:public-html-comments-request@w3.org] On Behalf Of hrhrhr hahaha
Sent: Wednesday, April 09, 2008 7:42 AM
To: public-html-comments@w3.org
Subject: charset


Hi,

IF a page has its charset set in the head section, via meta tag, and
elsewhere, within the same page is another charset used, aside from wrapping
it in an a tag/element, what would you use? Maybe, alongside lang (and
xml:lang) the charset 'could' be added to span? Even using utf-8, there
could be a charset used NOT in or recognised by utf, that could be 'added'
to the page via inline tag/element?
_________________________________________________________________
More immediate than e-mail? Get instant access with Windows Live Messenger.
http://www.windowslive.com/messenger/overview.html?ocid=TXT_TAGLM_WL_Refresh_instantaccess_042008
Received on Thursday, 10 April 2008 22:54:59 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 June 2011 00:13:58 GMT