Re: CDATA, Script, and Style from Henri Sivonen on 2009-04-01 (public-html@w3.org from April 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Wed, 1 Apr 2009 10:17:32 +0300
To: Jonas Sicking <jonas@sicking.cc>
Cc: Simon Pieters <simonp@opera.com>, Doug Schepers <schepers@w3.org>, HTML WG <public-html@w3.org>, www-svg@w3.org
Message-Id: <243209C3-844A-48E8-A611-5A315DBB4080@iki.fi>
On Apr 1, 2009, at 01:08, Jonas Sicking wrote:

> On Tue, Mar 31, 2009 at 2:30 AM, Henri Sivonen <hsivonen@iki.fi>  
> wrote:
>> On Mar 25, 2009, at 16:24, Simon Pieters wrote:
>>
>>> On Thu, 19 Mar 2009 18:52:25 +0100, Jonas Sicking <jonas@sicking.cc>
>>> wrote:
>>>
>>>> My feelings on 1 vs. 2 is:
>>>>
>>>> Problems with 1:
>>>> Parsing <![CDATA[]]> inside a CDATA element "feels" weird.
>>
>> I agree that it feels weird.
>>
>> I think the biggest problem with this entire issue is that the  
>> difference
>> between HTML <script> and <script> in XML is surprising and  
>> unintuitive, so
>> we will have a surprise boundary somewhere no matter what. It seems  
>> on the
>> general level we have the following options:
>>
>>  1) Have the surprise boundary between text/html and XML. (The  
>> situation
>> before SVG-in-text/html)
>>
>>  2) Have the surprise boundary between HTML <script> in text/html and
>> everything else. (The situation with SVG-in-text/html as drafted)
>>
>>  3) Have graded surprises with two boundaries:
>>    a) Have a surprise boundary between HTML <script> and SVG-in- 
>> text/html
>> <script> and another between SVG-in-text/html <script> and XML.
>>    b) Have a surprise boundary between pre-HTML5 <script> and HTML5
>> text/html <script>s and another between text/html and XML.
>>
>> I'm worried about escaping surprises in general having seen the RSS  
>> <title>
>> epic fail.
>
> I'm a little unclear as to what the behaviors in 3 are. I.e. which
> parsing/processing algorithms would lead to the two scenarios you
> describe?

3a would be:
HTML <script> in text/html is CDATA as it has always been. <! 
[CDATA[ is not special there. SVG <script> in text/html would be  
CDATA, except <![CDATA[ would be special.

3b would be:
HTML <script> and SVG <script> in text/html are CDATA, except <! 
[CDATA[ would be special in both.

> I'm also unclear as to what behavior you are proposing.

I was tentatively proposing that <![CDATA[ behaviors (if any) be  
handled as tokenizer states so that any given <![CDATA[ ... ]]>  
section parses as if would parse in XML.

This doesn't help at all with eliminating the feedback from the tree  
builder into the tokenizer. It was meant to smooth copypaste between  
different parts of a text/html document while keeping <! 
[CDATA[ ... ]]> consistent with XML.

> How do you
> feel about my proposal in
>
> http://lists.w3.org/Archives/Public/public-html/2009Mar/0634.html
>
> It would result in a graded surprise where there's some change between
> HTML <script> parsing between HTML4 and HTML5, and some surprise in
> the boundry between SVG-in-HTML and SVG-in-XML.

If this happened in the parser, it would result in <![CDATA[ ... ]]>  
in text/html parsing differently from both XML and previous text/html  
behavior. I think that could be confusing to authors who try to form a  
coherent mental model of the languages they are working with.

However, if <![CDATA[ ... ]]> remains in the DOM and is only stripped  
from the data in the JavaScript parser or the CSS parser, I suppose  
that model could count as coherent with the current <!-- --> treatment  
model for script and style in text/html.

>>>> Problems with 2:
>>>> Just stripping a heading and trailing "<![CDATA[" / "]]>" would  
>>>> break
>>>> markup like:
>>>> <style>
>>>> <![CDATA[
>>>> rect { fill: yellow; }
>>>> ]]>
>>>> <![CDATA[
>>>> circle { fill: blue; }
>>>> ]]>
>>>> </style>
>>>>
>>>> which probably happens occasionally due to copy-n-pasting.
>>
>> I don't like this, because it requires going back and modifying  
>> buffers that
>> had been already built instead of just tweaking forward-only  
>> tokenizer state
>> transitions, and it doesn't even work in the case where there are  
>> multiple
>> CDATA sections as shown above. If we end up doing something other  
>> than
>> what's currently in the draft, I'd much rather have what what Simon  
>> proposes
>> as #4.
>
> The stripping doesn't happen at a tokenizer stage. It happens after
> all parsing is done when the inline data is taken from the DOM and
> passed to the serializer.

Do you mean passed to the script engine? So the string "<![CDATA["  
would appear in the content of the text node in the DOM? I initially  
thought you meant removing "<![CDATA[" and "]]>" in the tree builder.

What about <![CDATA[ in SVG subtrees outside <script> and <style>?  
It's useful for graceful degradation but still involves feedback to  
the tokenizer unless supported anywhere outside foreign content as well.

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/
Received on Wednesday, 1 April 2009 07:18:20 UTC