[whatwg] base64 entities from Martin Janecke on 2010-08-26 (public-whatwg-archive@w3.org from August 2010)

From: Martin Janecke <whatwg.org@kaor.in>
Date: Thu, 26 Aug 2010 10:38:44 +0200
Message-ID: <4C762814.5010601@kaor.in>

Am 26.08.10 01:41, schrieb Adam Barth:
> On Wed, Aug 25, 2010 at 1:55 PM, Ian Hickson<ian at hixie.ch>  wrote:
>> On Wed, 25 Aug 2010, Adam Barth wrote:
>>> HTML should support Base64-encoded entities to make it easier for
>>> authors to include untrusted content in their documents without
>>> risking XSS.
>>
>> Seems like a fine idea. Get browsers to implement it and I'll spec it.
>
> I've posted a patch for WebKit:
>
> https://bugs.webkit.org/show_bug.cgi?id=44641
>
> Some subtleties:
>
> 1) Some base64 decoders tolerate newlines.  We don't want to decode
> entities with newlines.
> 2) Decoding base64 results in binary data.  We'll need to convert that
> data to characters in order to deal with it in the DOM.  We use always
> use UTF8 for that transformation, regardless of the document's
> encoding.
> 3) Null characters are replaced with U+FFFD.
> 4) The empty base64 entity&%; is consumed and is replaced with the
> empty string.
> 5) Invalid base64 is rejected and the entity is not decoded.
>
> Adam
>

Is it necessary to consider compatibility issues here? In HTML4 this
seems to have been valid code (-> http://validator.w3.org/check):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-type" content="text/html; charset=US-ASCII">
<title>base64 entity test</title>
</head>
<body>
<p>Look at these fine ASCII characters: &%4oCT;</p>
</body>
</html>

Now it would be interpreted differently. Could this lead to old
documents changing in meaning? Do we have to consider old documents that 
were not completely valid (e.g. lacked a doctype declaration)?

Received on Thursday, 26 August 2010 01:38:44 UTC