Re: CDATA, Script, and Style from Henri Sivonen on 2009-03-31 (public-html@w3.org from March 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Tue, 31 Mar 2009 12:30:56 +0300
To: Simon Pieters <simonp@opera.com>
Cc: "Jonas Sicking" <jonas@sicking.cc>, "Doug Schepers" <schepers@w3.org>, "HTML WG" <public-html@w3.org>, www-svg@w3.org
Message-Id: <6B9A519C-E8F5-42D0-91A3-17FD042EC5DF@iki.fi>
On Mar 25, 2009, at 16:24, Simon Pieters wrote:

> On Thu, 19 Mar 2009 18:52:25 +0100, Jonas Sicking <jonas@sicking.cc>  
> wrote:
>
>> My feelings on 1 vs. 2 is:
>>
>> Problems with 1:
>> Parsing <![CDATA[]]> inside a CDATA element "feels" weird.

I agree that it feels weird.

I think the biggest problem with this entire issue is that the  
difference between HTML <script> and <script> in XML is surprising and  
unintuitive, so we will have a surprise boundary somewhere no matter  
what. It seems on the general level we have the following options:

  1) Have the surprise boundary between text/html and XML. (The  
situation before SVG-in-text/html)

  2) Have the surprise boundary between HTML <script> in text/html and  
everything else. (The situation with SVG-in-text/html as drafted)

  3) Have graded surprises with two boundaries:
     a) Have a surprise boundary between HTML <script> and SVG-in-text/ 
html <script> and another between SVG-in-text/html <script> and XML.
     b) Have a surprise boundary between pre-HTML5 <script> and HTML5  
text/html <script>s and another between text/html and XML.

I'm worried about escaping surprises in general having seen the RSS  
<title> epic fail.

>> Parsing for
>> CDATA has remained largely the same since the dawn of human kind
>> (well, the particular branch of human kind that supports SGML). But
>> the bigger problem with supporting <!CDATA[]]> inside <script> is  
>> that
>> it'd break existing HTML content like:
>> <script>
>> x = "<res><![CDATA[if a < b < c then they are sorted]]></res>";
>> var parser = new DOMParser();
>> var doc = parser.parseFromString(x, "text/xml");
>> xhr = new XMLHttpRequest();
>> xhr.open("POST", uri);
>> xhr.send(doc);
>> </script>

This argues for preferring any of the options over 3b.

>> Problems with 2:
>> Just stripping a heading and trailing "<![CDATA[" / "]]>" would break
>> markup like:
>> <style>
>> <![CDATA[
>> rect { fill: yellow; }
>> ]]>
>> <![CDATA[
>> circle { fill: blue; }
>> ]]>
>> </style>
>>
>> which probably happens occasionally due to copy-n-pasting.

I don't like this, because it requires going back and modifying  
buffers that had been already built instead of just tweaking forward- 
only tokenizer state transitions, and it doesn't even work in the case  
where there are multiple CDATA sections as shown above. If we end up  
doing something other than what's currently in the draft, I'd much  
rather have what what Simon proposes as #4.

> (3) Have a "dirty" flag that's initially false and is set to true  
> when you see non-whitespace other than the string "<![CDATA[", which  
> is stripped and sets the insertion mode to "in CDATA section in  
> CDATA element" which eats the next "]]>" and switches back to the  
> previous insertion mode and resets the dirty flag.
>
> However this still wouldn't handle stuff like <script><![CDATA[x =  
> "<res><script></script></res>"]]></script> (which the  
> <script><!--...--></script> syntax supports).
>
> Also, should we support CDATA sections in RCDATA elements? Should  
> they make entities not be expanded there?
>
> (4) Make <![CDATA and ]]> equivalent to <!-- and --> in (R)CDATA,  
> except that they are stripped.

I think doing this for SVG <script> but not HTML <script> would be my  
preferred way of implementing option 3a (according to my numbering :-)  
of graded surprise.

The "<![CDATA[" in the DATA and CDATA states would transition to the  
CDATA section state and remember the original state as a return state  
if the tree builder is in foreign and "]]>" would transition back to  
the return state. (This doesn't help with speculative token streams,  
though. However, I'm currently optimistic that not having a  
speculative token stream at all might be feasible, but I don't have  
data yet. Working on it...)

Thus, CDATA sections in <script> or <style> would only serve to hide </ 
script> (or </style>) inside them. (Can anyone think of a security  
problem with this given semi-bogus naïve legacy XSS gatekeepers?) I  
want to keep CDATA section entry for foreign data state, because it  
enables more gracefully degrading SVG subtrees that contain text.

The Validator.nu tokenizer already has the concept of a single (i.e.  
not stacked) return state variable in the tokenizer, so  
infrastructurally this wouldn't be a big deal for me to implement.

I think the biggest risk of doing this is that it would create an RSS  
<title>-like situation between CDATA sectionless SVG-in-XML <script>  
and <style> and SVG-in-text/html <script> and <style>. However, the  
current draft creates the an RSS <title>-like situation within text/ 
html, which could well be considered worse.

> I think in general you'd be pretty lucky if you didn't have to  
> modify scripts in SVG when pasted into text/html, so requiring  
> authors to remove the CDATA strings or prepend them with // isn't  
> too much to ask for, IMHO.

Having to change the scripting logic for the new context is a more  
intuitive requirement than having to know the gory details of CDATA  
tokenization magic, though.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Tuesday, 31 March 2009 09:31:44 UTC