Re: CDATA, Script, and Style from Jonas Sicking on 2009-04-06 (public-html@w3.org from April 2009)

From: Jonas Sicking <jonas@sicking.cc>
Date: Mon, 6 Apr 2009 13:45:16 -0700
To: Henri Sivonen <hsivonen@iki.fi>
Cc: Simon Pieters <simonp@opera.com>, Doug Schepers <schepers@w3.org>, HTML WG <public-html@w3.org>, "www-svg@w3.org" <www-svg@w3.org>
Message-ID: <63df84f0904061345j323d0709n774d7ed638f9e977@mail.gmail.com>

On Mon, Apr 6, 2009 at 11:54 AM, Henri Sivonen <hsivonen@iki.fi> wrote:
>
> On Apr 1, 2009, at 10:37, Jonas Sicking wrote:
>
>>>>>>> Problems with 2:
>>>>>>> Just stripping a heading and trailing "<![CDATA[" / "]]>" would break
>>>>>>> markup like:
>>>>>>> <style>
>>>>>>> <![CDATA[
>>>>>>> rect { fill: yellow; }
>>>>>>> ]]>
>>>>>>> <![CDATA[
>>>>>>> circle { fill: blue; }
>>>>>>> ]]>
>>>>>>> </style>
>>>>>>>
>>>>>>> which probably happens occasionally due to copy-n-pasting.
>>>>>
>>>>> I don't like this, because it requires going back and modifying buffers
>>>>> that
>>>>> had been already built instead of just tweaking forward-only tokenizer
>>>>> state
>>>>> transitions, and it doesn't even work in the case where there are
>>>>> multiple
>>>>> CDATA sections as shown above. If we end up doing something other than
>>>>> what's currently in the draft, I'd much rather have what what Simon
>>>>> proposes
>>>>> as #4.
>>>>
>>>> The stripping doesn't happen at a tokenizer stage. It happens after
>>>> all parsing is done when the inline data is taken from the DOM and
>>>> passed to the serializer.
>>>
>>> Do you mean passed to the script engine?
>>
>> Yes, thanks.
>>
>>> So the string "<![CDATA[" would appear in the content of the text node in
>>> the DOM?
>>
>> Yes
>
> If "<![CDATA[" ends up in the DOM, I think the end result could be made more
> robust if the operation of handing DOM data to the CSS or JS parser didn't
> try to drop "<![CDATA[" and "]]>" but instead the JS and CSS parser were
> changed to treat those strings as comments, i.e. like "/* */". This way,
> they wouldn't be dropped from within potentially existing string literals.
>
> This approach would cause notable leakage of the SVG-in-text/html feature
> into other parts of a browser engine, though, which isn't very nice.

Indeed. I think this would be unfortunate, but definitely a
possibility. Given that '<!--' and '-->' is discarded by the same
engines then

I do think that if we just strip "<!CDATA[" from the beginning of the
contents and "]]>" from the end, we wouldn't need to leak SVG-in-HTML
into other parts of the engine. As outlined in

http://lists.w3.org/Archives/Public/public-html/2009Mar/0241.html

> Also, I'm a bit concerned that letting "<![CDATA[" and "]]>" reach the DOM
> would result in those strings being escaped as "&gt;![CDATA[" and "]]&lt;"
> if serialized to XML, so going back and forth a couple of times through real
> serializer and via copying and pasting would result in some ugly cruft.

We don't seem to have this problem right now with <script><!--
scriptHere() --></script> so I'm not sure there's a reason to think it
will be more of a problem with <script><![CDATA[ scriptHere()
]]></script>.

>>> What about <![CDATA[ in SVG subtrees outside <script> and <style>? It's
>>> useful for graceful degradation but still involves feedback to the tokenizer
>>> unless supported anywhere outside foreign content as well.
>>
>> I think that is mostly an orthogonal issue. But I would like <![CDATA[ ]]>
>> in to be parsed as in XML both in foregin content mode, and in normal mode.
>> To keep things consistent.
>
> I think it's relevant in two ways:
>
> 1) If the syntax behaves as in XML outside <script> and <style> but not as
> in XML inside <script> and <style>, the result may be confusing.
>
> 2) Having CDATA sections that behave like XML CDATA sections in HTML5
> parsers but like bogus comments in earlier browsers is useful for hiding SVG
> text from old browsers for graceful degradation. However, if this syntax
> causes feedback from the tree builder to the tokenizer, we haven't managed
> to completely eliminate the (non-trivial) feedback to the tokenizer meaning
> the other efforts to do so wouldn't be very useful.

I think allowing <![CDATA[ ]]> everywhere where we parse PCDATA and
RCDATA would be nice both from an authoring point of view since it
makes for more consistency, but also from an implementation point of
view because it's basically the only remaining feedback from the
parser to the tokenizer, isn't it?

/ Jonas

Received on Monday, 6 April 2009 20:46:08 UTC