- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Tue, 31 Mar 2009 12:30:56 +0300
- To: Simon Pieters <simonp@opera.com>
- Cc: "Jonas Sicking" <jonas@sicking.cc>, "Doug Schepers" <schepers@w3.org>, "HTML WG" <public-html@w3.org>, www-svg@w3.org
On Mar 25, 2009, at 16:24, Simon Pieters wrote: > On Thu, 19 Mar 2009 18:52:25 +0100, Jonas Sicking <jonas@sicking.cc> > wrote: > >> My feelings on 1 vs. 2 is: >> >> Problems with 1: >> Parsing <![CDATA[]]> inside a CDATA element "feels" weird. I agree that it feels weird. I think the biggest problem with this entire issue is that the difference between HTML <script> and <script> in XML is surprising and unintuitive, so we will have a surprise boundary somewhere no matter what. It seems on the general level we have the following options: 1) Have the surprise boundary between text/html and XML. (The situation before SVG-in-text/html) 2) Have the surprise boundary between HTML <script> in text/html and everything else. (The situation with SVG-in-text/html as drafted) 3) Have graded surprises with two boundaries: a) Have a surprise boundary between HTML <script> and SVG-in-text/ html <script> and another between SVG-in-text/html <script> and XML. b) Have a surprise boundary between pre-HTML5 <script> and HTML5 text/html <script>s and another between text/html and XML. I'm worried about escaping surprises in general having seen the RSS <title> epic fail. >> Parsing for >> CDATA has remained largely the same since the dawn of human kind >> (well, the particular branch of human kind that supports SGML). But >> the bigger problem with supporting <!CDATA[]]> inside <script> is >> that >> it'd break existing HTML content like: >> <script> >> x = "<res><![CDATA[if a < b < c then they are sorted]]></res>"; >> var parser = new DOMParser(); >> var doc = parser.parseFromString(x, "text/xml"); >> xhr = new XMLHttpRequest(); >> xhr.open("POST", uri); >> xhr.send(doc); >> </script> This argues for preferring any of the options over 3b. >> Problems with 2: >> Just stripping a heading and trailing "<![CDATA[" / "]]>" would break >> markup like: >> <style> >> <![CDATA[ >> rect { fill: yellow; } >> ]]> >> <![CDATA[ >> circle { fill: blue; } >> ]]> >> </style> >> >> which probably happens occasionally due to copy-n-pasting. I don't like this, because it requires going back and modifying buffers that had been already built instead of just tweaking forward- only tokenizer state transitions, and it doesn't even work in the case where there are multiple CDATA sections as shown above. If we end up doing something other than what's currently in the draft, I'd much rather have what what Simon proposes as #4. > (3) Have a "dirty" flag that's initially false and is set to true > when you see non-whitespace other than the string "<![CDATA[", which > is stripped and sets the insertion mode to "in CDATA section in > CDATA element" which eats the next "]]>" and switches back to the > previous insertion mode and resets the dirty flag. > > However this still wouldn't handle stuff like <script><![CDATA[x = > "<res><script></script></res>"]]></script> (which the > <script><!--...--></script> syntax supports). > > Also, should we support CDATA sections in RCDATA elements? Should > they make entities not be expanded there? > > (4) Make <![CDATA and ]]> equivalent to <!-- and --> in (R)CDATA, > except that they are stripped. I think doing this for SVG <script> but not HTML <script> would be my preferred way of implementing option 3a (according to my numbering :-) of graded surprise. The "<![CDATA[" in the DATA and CDATA states would transition to the CDATA section state and remember the original state as a return state if the tree builder is in foreign and "]]>" would transition back to the return state. (This doesn't help with speculative token streams, though. However, I'm currently optimistic that not having a speculative token stream at all might be feasible, but I don't have data yet. Working on it...) Thus, CDATA sections in <script> or <style> would only serve to hide </ script> (or </style>) inside them. (Can anyone think of a security problem with this given semi-bogus naïve legacy XSS gatekeepers?) I want to keep CDATA section entry for foreign data state, because it enables more gracefully degrading SVG subtrees that contain text. The Validator.nu tokenizer already has the concept of a single (i.e. not stacked) return state variable in the tokenizer, so infrastructurally this wouldn't be a big deal for me to implement. I think the biggest risk of doing this is that it would create an RSS <title>-like situation between CDATA sectionless SVG-in-XML <script> and <style> and SVG-in-text/html <script> and <style>. However, the current draft creates the an RSS <title>-like situation within text/ html, which could well be considered worse. > I think in general you'd be pretty lucky if you didn't have to > modify scripts in SVG when pasted into text/html, so requiring > authors to remove the CDATA strings or prepend them with // isn't > too much to ask for, IMHO. Having to change the scripting logic for the new context is a more intuitive requirement than having to know the gory details of CDATA tokenization magic, though. -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Tuesday, 31 March 2009 09:31:44 UTC